Surprising Search Patterns

A new study questions the common assumption that search engines control the hierarchy of the Internet.

Kate Greenearchive page

August 18, 2006

Conventional wisdom says that search engines are a fundamentally unfair technology – favoring the most popular sites and helping them to become even more popular. This assumption, captured in the term “Googlearchy,” is now being challenged by researchers at Indiana University who have used real-life data to test it. Their results show that Web-surfing behavior isn’t as influenced by search-engine rankings as was previously thought.

This plot of red splotches, which represent popular websites, illustrates the idea that websites at the top of search engine results become increasingly popular, a concept known as “Googlearchy.” (Credit: Filippo Menczer and the trustees of Indiana University)

Understanding the impact of search engines isn’t just an academic undertaking, says Filippo Menczer, professor of informatics and computer science at the University of Indiana in Bloomington. It has implications for creating online advertising models based on search results, building better search engines, devising online political campaigns, and understanding how people use the Internet. “Search engines have become the gateways between people and information,” he says. “If a search engine has a bias, it has a huge impact because it can direct people to one sort of information and not another.”

Search engines rank and list pages by popularity, a feature measured, in part, by how well-connected a page is to the rest of the Web. The more pages linking to a certain page, the higher that page will rank. Since these highly ranked sites are easier to find through a search, they will continue to get more hits. “The more popular sites get more and more links and new sites have no hope,” says Menczer.

The researchers created two extreme Web-browsing models: a person who used only search engines to find content and a person who browsed without search engines, instead following links from one page to another. The researchers then compared these two models with real-life data about site traffic for Web pages and the number of links pointing to those pages.

They expected the real-world data to fall somewhere between the two extremes: targeted searching and haphazard surfing. Instead, it turned out that typical Web use – presumably a combination of searching and surfing – concentrated less on popular Web sites than either model had predicted. In other words, real-world Web searching does not fuel the Googlearchy nor does it keep less-popular sites from being found. “This was not what we expected and we were surprised by it,” says Menczer.

The explanation appears to be fairly simple: more and more people are searching for more specific information. If someone submits a general query, say, “bird flu,” the results at the top of a search-engine’s results page will indeed list high-traffic websites, for example, the Centers for Disease Control site. And that site’s popularity will be reinforced. But Web searches are becoming increasingly more complex, according to Menczer. A search for “bird flu Turkey 2005” will bring up far fewer results, and lead to more obscure pages. “If you consider that people submit diverse queries that return a small number of hits,” he says, “that means traffic is distributed to less-popular sites.”

The results are somewhat controversial because many people have been operating under the assumption that a Googlearchy does exist, says Albert-László Barabási, professor of physics at the University of Notre Dame and also an expert on Internet behavior and how websites are connected to each other. He agrees with Menczer that general searches do make some types of sites more popular. “I think the message here is that as soon as you become a slightly more sophisticated searcher, then you’re breaking the spell of the Web,” he says.

The theory that people are becoming more adept in searching the Web is borne out by some hard data, too. According to Hitwise, a firm that tries to improve companies’ search rankings, people are increasingly using more words per search query. Based on this trend, Menczer’s research seems reasonable, says Bill Tancer, general manager of global research at Hitwise.

But Tancer also questions the quality of data used to test the researchers’ models. For example, the traffic data for the research was gleaned from a free, downloadable search tool, Alexa, which provides Web statistics. But, according to Tancer, this data could be biased because Alexa users tend to be online marketers rather than average Web users.

In addition, the study used data from 2003, and “a lot has changed since then,” says Tancer. Hitwise data, which is collected directly from Internet service providers such as AT&T, suggests that people interact with the Web in a number of ways, not just by either using searching engines or surfing. Tancer says people also end up on sites from directly typing in a URL, through sponsored links, where companies pay money to appear prominently on a search page, and through social networking sites.

Indiana’s Menczer says that the paper, released last week in the Proceedings of the National Academy of Sciences, is a first attempt to show how Web data may or may not corroborate the idea of a Googlearchy. Currently, his group is exploring the effects of other modes of Web use, including social search, to see if sites such as digg.com and del.icio.us amplify or diminish his team’s results.

Meanwhile, the Indiana researchers’ work provides an important analysis of a commonly held assumption about search engines, says Matt Hindman, professor of political science at Arizona State University in Phoenix. Using “empirical data to model these relationships rather than just assume” is what had been missing, he says.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.