Technology Review - Published By MIT
Advertisement

August 2005

From the Lab: Information Technology

Continued from page 1

By Technology Review

smaller text tool iconmedium text tool iconlarger text tool icon

Smarter Search
Streamlining retrieval on the Web

Results: Some Web-search sites like Clusty and Teoma sort results into categories to help users narrow their searches.Researchers at IBM have devised an algorithm that allows search programs to display a wider selection of categories by analyzing the content of a sample of results rather than that of every page. The researchers performed searches of 1.8 million Web pages, analyzing both the entire body of results and the sample populations selected by the algorithm. They found that even when samples constituted only 1 percent of the total results, the algorithm could still capture most of the popular categories extracted from all the results.

Why it Matters: Looking for information online can be frustrating when search terms have multiple meanings and contexts. Sorting results into "clusters" of related topics can help cut search times, but most search engines that use this technique examine only the most relevant few hundred results to extract common themes. So even topics with plenty of pages devoted to them can be ignored in favor of trendier subjects associated with the same keywords: a search for "macintosh" will identify themes prominent on millions of computer-gossip pages but entirely miss those few thousand pages about Charles Macintosh, father of the rubberized raincoat. The sampling methods devised by Aris Anagnostopoulos, now at Brown University, and Andrei Broder and David Carmel at IBM could allow users to quickly find the pages they want, even when their search terms are ambiguous.

Methods: In a large search, collecting a representative sample is not easy. Most search engines assemble results not all at once, but a handful at a time as needed. They first generate a list of matching pages for each keyword in a query. Those lists are merged, about a hundred results at a time, using logical operators extracted from the query -- words such as "and" and "or." The IBM algorithm, on the other hand, simultaneously sifts through these multiple lists, picking Web pages at random and, if they meet all the conditions of the search, adding them to the sample pool. The algorithm takes measures to ensure that each Web page in a list has an equal probability of being chosen. A search engine could use the sample pool to determine sorting themes.

Next Step: Devising custom sampling techniques to handle the most common types of queries could yield speedier search results. Anagnostopoulos is also interested in investigating whether, when devising sorting categories, giving less popular pages even more weight leads to better results. -- By Dan Cho

Source: Anagnostopoulos, A., et al. 2005. Sampling search-engine results. Paper presented at the 14th International World Wide Web Conference. May 10-14. Chiba, Japan.

August 2005

Would you like to read more articles from the August 2005 issue?

This article is from the August 2005 Issue of Technology Review. To read other articles from this issue simply register for My.TechnologyReview.com. It's free.

Subscribe today and save up to 41% »

Comments

Advertisement

Current Issue

Technology Review November/December 2008
Sun + Water = Fuel
An MIT chemist has opened the way to making hydrogen fuel from water using sunlight.
•  Subscribe
Save 41%
•  Table of Contents
•  MIT News

Magazine Services

Career Resources

MIT Technology Insider

Stories and breaking news from inside MIT about the latest research, innovations, and startups--in a convenient monthly e-newsletter. Subscribe today

Follow us on Twitter

Twitter

Get Technology Review updates via the web, cellphone, or Instant Messager – Follow techreview on Twitter!

Advertisement

More Technology News from Forbes

Advertisement
Advertisement
TECHNOLOGY RESOURCES
Advertisement
MIT Massachusetts Institute of Technology