By nature chaotic and decentralized, the Web screams out for tools to help people hunt down documents no matter where they reside. Say you want information on treatments for scurvy in the 18th century: without a search engine, you have no way of knowing that the data you need is stored only in places like a cryptically named file (www.jameslindlibrary.org/trial_records/17th_18th_ Century/lind/lind_kp.html) on a server at the library of the Royal College of Physicians in Edinburgh, Scotland.
When you type “scurvy” into a search box at Google or MSN or Ask Jeeves, however, you’re still not touching the actual file at the Royal College. You’re merely rifling through the search company’s index of the Web-a huge list assembled by software “spiders” that crawl through thousands of pages every second, copying keywords, phrases, titles and subtitles, links, and other descriptive information. Once a fragment of information lands in the index, it’s usually compressed, assigned a “weight” or importance, and stored in a database for quick retrieval. The search terms you enter are compared against this index, and links to pages that contain one or more of your terms are displayed in order of relevance.
How a search engine determines that relevance is the secret sauce. Google rocketed to prominence in 1999 largely due to PageRank, an algorithm invented by founders Larry Page and Sergey Brin that was the first to capitalize on the massive interlinking of Web pages. Each link is, in effect, a vote made by the author of one page for the contents of another. Page and Brin realized that if their index were big enough, they’d be able to assess a page’s importance by counting the number of other pages that linked to it. They took other factors into account as well, such as the pertinence of the text surrounding the links and the linking pages’ own popularity. But their groundbreaking insight was that the Web is a giant popularity contest-and that the most-cited pages will probably be the most useful. The technique worked fiendishly well, and Web users voted with their clicks. Between June 2000 and January 2004, former top dog AltaVista, which ranked results largely according to the number of times a page mentioned the user’s search keywords, dropped from eighth place in overall Web traffic rankings to 61st, while Google climbed from near-invisibility to fourth place, according to data from research firms Media Metrix and Alexa. Google has so pervaded the Web that its very name was selected by the American Dialect Society as the most useful new word of 2002.
Despite its advantages, PageRank has a few flaws. Just as earlier search engines could be fooled by pages peppered with thousands of keywords in “invisible” white-on-white type, an unscrupulous site owner who wants his Web address to appear higher in Google’s search results can easily publish thousands or even millions of junk pages that contain links to his site, artificially raising its rank. (Google says it has ways of counteracting such attacks, but won’t discuss them.) The same loophole in PageRank allows “Google bombing”-a recent phenomenon in which bloggers make a humorous or political point by creating so many links to a given site that it comes up first when users type a specific term into the Google search box. Google bombers protesting the war in Iraq, for instance, managed to make George W. Bush’s White House biography the first-ranked result under “miserable failure.”
More bothersome to some critics, however, is PageRank’s obsession with fame. A legitimate page that matches a Google user’s search terms perfectly may get buried in search results simply because there aren’t enough other pages pointing to it, notes Daniel Brandt, a Web developer who runs a critical site called Google Watch. A page’s relevance to an individual user, Brandt and other critics argue, may depend on more than its popularity. “Just because the rest of the planet thinks that this is the number one travel site doesn’t mean it is the number one travel site for you,” says Liesl Capper, founder and CEO of Sydney-based upstart Mooter, who believes she just might have a better way.
Putting a Stamp on Search Results
Enter the same search term into ten different search engines, and you’re likely to get ten conflicting sets of results. That’s partly because the search companies’ spiders crawl different subsets of the Web; but more importantly, it’s a reflection of the unique principles at work in each company’s ranking algorithms. Here’s how three search engines handle the term “stamp collecting.”
|1.||American Philatelic Society||Coin and Stamp Collecting (About.com)||Stamp Link-Philately, Stamp Collecting’s Best Site in Its Category|
|2.||Joseph Luft’s Philatelic Resources on the Web
||Joseph Luft’s Philatelic Resources on the Web
||Warragul Philatelic Society (Stamp Collecting)|
|3.||Linns.com: The website of the world’s largest weekly stamp newspaper-Linn’s Stamp News||American Philatelic Society||Postal history, philatelic covers and stamps for sale|
|4.||Stamp Link-Philately, Stamp Collecting’s Best Site in Its Category||BNAPS Stamp Collecting for Kids||Great Britain Philatelic Society|
|5.||Philatelic.Com||Linns.com: The website of the world’s largest weekly stamp newspaper-Linn’s Stamp News||Stamp Collecting, Philately, Stamp Auction|
|The top-ranked page has the highest “authority” -essentially, the most links-among communities of Web sites about stamp collecting. It’s validated by references from resource pages (link collections from experts and enthusiasts-in this case, stamp collectors) and link popularity measurements similar to Google’s. The runner-up, Joseph Luft’s Philatelic Resources, has fewer and less qualified referrals from experts on the subject.||Google officials won’t discuss how the Google engine arrives at rankings for specific sites. Patent documents and published papers, however, show that Google ranks pages according to how often other pages link to them. Google also takes into account such factors as the relevance of the referring pages and the text surrounding the links. Presumably, collectstamps. about.com is the most cited page on this subject in Google’s index.||Mooter first groups results into clusters or themes. The pages shown above appear in the “philatelic” cluster, ranked according to how frequently the search keywords and the cluster name appear on each page, among other factors. Mooter “learns” the user’s intent by noting which clusters and pages are clicked on, and reranks the results to reflect the apparent pattern of interest.|