Select your localized edition:

Close ×

More Ways to Connect

Discover one of our 28 local entrepreneurial communities »

Be the first to know as we launch in new countries and markets around the globe.

Interested in bringing MIT Technology Review to your local market?

MIT Technology ReviewMIT Technology Review - logo


Unsupported browser: Your browser does not meet modern web standards. See how it scores »

{ action.text }

Smarter Search
Streamlining retrieval on the Web

Results: Some Web-search sites like Clusty and Teoma sort results into categories to help users narrow their searches.Researchers at IBM have devised an algorithm that allows search programs to display a wider selection of categories by analyzing the content of a sample of results rather than that of every page. The researchers performed searches of 1.8 million Web pages, analyzing both the entire body of results and the sample populations selected by the algorithm. They found that even when samples constituted only 1 percent of the total results, the algorithm could still capture most of the popular categories extracted from all the results.

Why it Matters: Looking for information online can be frustrating when search terms have multiple meanings and contexts. Sorting results into “clusters” of related topics can help cut search times, but most search engines that use this technique examine only the most relevant few hundred results to extract common themes. So even topics with plenty of pages devoted to them can be ignored in favor of trendier subjects associated with the same keywords: a search for “macintosh” will identify themes prominent on millions of computer-gossip pages but entirely miss those few thousand pages about Charles Macintosh, father of the rubberized raincoat. The sampling methods devised by Aris Anagnostopoulos, now at Brown University, and Andrei Broder and David Carmel at IBM could allow users to quickly find the pages they want, even when their search terms are ambiguous.

Methods: In a large search, collecting a representative sample is not easy. Most search engines assemble results not all at once, but a handful at a time as needed. They first generate a list of matching pages for each keyword in a query. Those lists are merged, about a hundred results at a time, using logical operators extracted from the query – words such as “and” and “or.” The IBM algorithm, on the other hand, simultaneously sifts through these multiple lists, picking Web pages at random and, if they meet all the conditions of the search, adding them to the sample pool. The algorithm takes measures to ensure that each Web page in a list has an equal probability of being chosen. A search engine could use the sample pool to determine sorting themes.

Next Step: Devising custom sampling techniques to handle the most common types of queries could yield speedier search results. Anagnostopoulos is also interested in investigating whether, when devising sorting categories, giving less popular pages even more weight leads to better results. – By Dan Cho

Source: Anagnostopoulos, A., et al. 2005. Sampling search-engine results. Paper presented at the 14th International World Wide Web Conference. May 10-14. Chiba, Japan.

0 comments about this story. Start the discussion »

Tagged: Computing

Reprints and Permissions | Send feedback to the editor

From the Archives


Introducing MIT Technology Review Insider.

Already a Magazine subscriber?

You're automatically an Insider. It's easy to activate or upgrade your account.

Activate Your Account

Become an Insider

It's the new way to subscribe. Get even more of the tech news, research, and discoveries you crave.

Sign Up

Learn More

Find out why MIT Technology Review Insider is for you and explore your options.

Show Me