How Google Ranks Tweets

Algorithms judge the relevance of microblog posts containing 140 characters or less.

David Talbotarchive page

January 13, 2010

To deliver useful search returns from the so-called real-time Web–such as seconds-old Twitter “tweets” reporting traffic jams–Google has adapted its page-ranking technology and developed new algorithmic tricks and filters to keep returns relevant, according to a leading Google engineer.

Google rolled out real-time search technology last month, to offer searchers access to brand-new blog posts and news items far faster than the five to 15 minutes it previously took Google’s Web crawlers to discover newly created items.

Bing, Cuil, and other search engines also provide various kinds of real-time results. Both Google and Bing have also forged major deals with Twitter to get real-time access to tweets, those 140-character microblog posts sent out by Twitter members. But Google claims to offer the most comprehensive real-time results by scanning news headlines, blogs, and feeds from Facebook, MySpace, Twitter, and other sources.

The tweets are a mainstay of Google’s real-time results, but Google has not previously discussed how it ranks them. A fundamental Google strategy for identifying tweet relevance is analogous to that used by Google’s PageRank technology, which helps find relevant Web pages with traditional Web search. Under PageRank, Google judges the importance of pages containing a given search keyword in part by looking at the pages’ link structure. The more pages that link to a page–and the more pages linking to the linkers–the more relevant the original page.

In the case of tweets, the key is to identify “reputed followers,” says Amit Singhal, a Google Fellow, who led development of real-time search. (Twitterers “follow” the comments of other Twitterers they’ve selected, and are themselves “followed.”)

“You earn reputation, and then you give reputation. If lots of people follow you, and then you follow someone–then even though this [new person] does not have lots of followers,” his tweet is deemed valuable because his followers are themselves followed widely, Singhal says. It is “definitely, definitely” more than a popularity contest, he adds.

“One user following another in social media is analogous to one page linking to another on the Web. Both are a form of recommendation,” Singhal says. “As high-quality pages link to another page on the Web, the quality of the linked-to page goes up. Likewise, in social media, as established users follow another user, the quality of the followed user goes up as well.”

But Google’s social-ranking tricks are hardly the only method the search giant uses to extract relevance from tweets. Google also developed new ways to choose which (if any) tweets to surface for common terms like “Obama”–and to avoid spam or low-quality tweets–all within seconds.

One problem with tweets is that people often lard them up with so-called “hashtags.” These are symbols that start with a pound sign (#) followed by a word that represents a very popular current topic, such as “Nexus One” or “Earthquake” or whatever else might be a trendy topic at the moment. When a hashtag is included in a tweet, the resulting tweet will show up when other Twitterers click the hashtag’s topic word elsewhere on the site.

While such tags can usefully maximize exposure of a tweet, they can also serve as red flags to lower tweet quality and attract spam-like content, Singhal says. While he wouldn’t get into details, he said Google modeled this hashtagging behavior in ways that tend to reduce the exposure of low-quality tweets. “We needed to model that [hashtagging] behavior. That is the technical challenge which we went after with our modeling approaches,” Singhal says.

Another problem: how, if someone is searching for “Obama,” to sift through White House press tweets and thousands of others to find the most timely and topical information. Google scans tweets to find the “signal in the noise,” he says. Such a “signal” might include a new onslaught of tweets and other blogs that mention “Cambridge police” or “Harry Reid” near mentions of “Obama.” By looking out for such signals, Google is able to furnish real-time hits that contain the freshest subject matter even for very common search terms.

In the future, both Twitter and Google hope to improve the relevance of search returns in all contexts by adding geo-location data, which can be added to postings sent from smart phones. In general, real-time search “is evolving,” says Dylan Casey, the Google product manager for real-time search. “I talk with the guys at Twitter on a regular basis to learn where the feature is going. We get feedback from them, we give them feedback, and our engineers collaborate. It is truly symbiotic.”

Singhal added that Twitter is hardly the only source of real-time information. “Twitter is indeed a very important component of the real-time Web. However, what we are observing is that it is just one of the components. There’s a lot of value in news, blogs, and Web pages that are being generated in real-time, because news organizations work very hard to get quality to a certain level,” he says. “Twitter is indeed useful because it is short-form content. However, we are finding that the real-time Web is much bigger.”