Streamlining retrieval on the Web
Results: Some Web-search sites like Clusty and Teoma sort results into categories to help users narrow their searches.Researchers at IBM have devised an algorithm that allows search programs to display a wider selection of categories by analyzing the content of a sample of results rather than that of every page. The researchers performed searches of 1.8 million Web pages, analyzing both the entire body of results and the sample populations selected by the algorithm. They found that even when samples constituted only 1 percent of the total results, the algorithm could still capture most of the popular categories extracted from all the results.
Why it Matters: Looking for information online can be frustrating when search terms have multiple meanings and contexts. Sorting results into “clusters” of related topics can help cut search times, but most search engines that use this technique examine only the most relevant few hundred results to extract common themes. So even topics with plenty of pages devoted to them can be ignored in favor of trendier subjects associated with the same keywords: a search for “macintosh” will identify themes prominent on millions of computer-gossip pages but entirely miss those few thousand pages about Charles Macintosh, father of the rubberized raincoat. The sampling methods devised by Aris Anagnostopoulos, now at Brown University, and Andrei Broder and David Carmel at IBM could allow users to quickly find the pages they want, even when their search terms are ambiguous.
Methods: In a large search, collecting a representative sample is not easy. Most search engines assemble results not all at once, but a handful at a time as needed. They first generate a list of matching pages for each keyword in a query. Those lists are merged, about a hundred results at a time, using logical operators extracted from the query – words such as “and” and “or.” The IBM algorithm, on the other hand, simultaneously sifts through these multiple lists, picking Web pages at random and, if they meet all the conditions of the search, adding them to the sample pool. The algorithm takes measures to ensure that each Web page in a list has an equal probability of being chosen. A search engine could use the sample pool to determine sorting themes.
Next Step: Devising custom sampling techniques to handle the most common types of queries could yield speedier search results. Anagnostopoulos is also interested in investigating whether, when devising sorting categories, giving less popular pages even more weight leads to better results. – By Dan Cho
Source: Anagnostopoulos, A., et al. 2005. Sampling search-engine results. Paper presented at the 14th International World Wide Web Conference. May 10-14. Chiba, Japan.