A Starburst of Ideas
I’m lunching with Capper on a brilliant early-winter day in San Francisco. She’s in town to call on potential investors and customers. “People who control the flow of information have a subtle but pervasive power,” she tells me earnestly. “Someone has to hold that power, and it is important that the people who do are those who consciously try to have a positive impact, and who give power back to the individual.” Mooter aims to do that by making Web searches easier and more personal. Capper grew up in Zambia, studied psychology in South Africa, and founded a chain of early-childhood-development centers before emigrating to Australia in 1997 and choosing search technology as the place to make her next impact. She set up shop in downtown Sydney and hired Jondarr Gibb, an experienced software architect, and John Zakos, a graduate student writing his Griffith University doctoral thesis on the applications of neural-network theory to Internet searches.
The three have mixed their ideas on psychology, software, and neural networks to create a ranking algorithm that learns from the user as a search progresses. Before dumping a long list of links on a user, Mooter analyzes the potential meanings and permutations of the starting keywords and, behind the scenes, ranks the relevance of the resulting Web pages within broad categories called clusters. The user first sees an on-screen “starburst” of cluster names. A search on the name Paul Czanne, for example, yields clusters such as art, artists, Czanne, France, galleries, and famous paintings. That’s the psychology part. “When you do a traditional search, you get your millions of results, and your mind does a conceptual grouping,” says Capper. “But our minds are hard-wired to process only three to five kinds of information at once. We decided not to override that but to work with it.”
Then comes the learning part. To develop a more precise understanding of what the user is probably looking for, the Mooter engine notes which clusters and links get clicked and uses that information to improve future responses. Suppose a user enters the term “dog,” clicks on a cluster called “breeds,” and then spends a lot of time looking at sites about Schnoodles (a popular Schnauzer-Poodle mix). When the user clicks on a new search result, Mooter will personalize the ranking to reflect this apparent pattern of interest, which might, for example, lead to sites about “dogs” plus “breeds” plus “Schnoodles” appearing higher. A refined set of results appears on every page; the engine continues to adjust the rankings based on the user’s behavior.
The whole idea is to give people the results they want in as few clicks as possible. “Two clicks and we already have a very good idea of where you’re heading,” says Capper. When Mooter’s beta site debuted last October, Capper didn’t expect it to be noticed outside Australia. But traffic from around the world has been so heavy, she says, that the company has had to install more Web servers to keep the service running.
Spend much time talking to search-industry insiders and you’ll realize that there are almost as many ways to rank search results as there are pages on the Web. Google’s supposed overreliance on popularity was one of the inspirations behind Teoma (pronounced tay-o-ma), founded in 2000 by computer scientist Apostolos Gerasoulis and colleagues at Rutgers University in New Jersey. Teoma’s search software now powers Ask Jeeves, the number six search site. Google “looks at the structure of the Web, but that method doesn’t go down to the next level,” says Paul Gardi, Teoma’s senior vice president for search. “When you get down to the local level, you will find that links cluster around certain subjects or themes, very much like communities.” For instance, pages on “home improvement” don’t simply link upward to more popular pages; they also tend to link to each other, forming circles around prominent sites like Hometime.com, Homeideas.com, and BobVila.com.
The Rutgers scientists designed Teoma (Gaelic for “expert”) to find those subject-specific communities and exploit their wisdom. Before the Teoma engine presents the results for a given set of keywords, Gardi explains, it identifies the associated communities and looks for the “authorities” within them-that is, the pages that community members’ Web sites point to most often. Teoma tries to verify the credibility of these authority pages by checking whether they’re listed on resource pages created by subject experts or enthusiasts, which tend to link to the best pages within the community. It then ranks search results according to how often each page is cited by authority pages.
IBM and other organizations experimented with similar authority-based ranking systems in the late 1990s, but Gerasoulis says their approaches could take hours to slog through all the pages out there. Gerasoulis’s proprietary technique does the same thing in about a fifth of a second. Ask Jeeves dumped its previous search provider and switched to Teoma’s technology in 2001, and its query volumes jumped 30 percent per year in 2002 and 2003.
Hard as it may be to believe when you’re looking at a dozen pages of search results, today’s search engines ignore most of what is out there on the Internet. Software spiders have difficulty indexing content that is protected behind sign-up forms or stored in databases such as product catalogues or legal and medical archives and only assembled into Web pages at the moment users request it. This so-called Deep Web may amount to as much as 92 petabytes (92 million gigabytes) worldwide, or nearly 500 times the volume of the surface Web, according to the School of Information Management and Systems at the University of California, Berkeley.
Mining the Deep Web is the mission of another fresh face in the search business-Chicago-based Dipsie. “Google and Teoma only index about 1 percent of the documents out there,” says Jason Wiener, Dipsie’s founder and chief technology officer. Wiener, a self-taught programmer who ran a San Francisco Web development company until the dot-com crash, has spent the last two years building a more nimble crawler, one that can get past forms and database interfaces. Say you’re wondering about the standard equipment on a Mercedes 55SL convertible. At Cars.com, drilling down to the page with detailed product information will take about six steps. Dipsie, however, will have indexed the entire Cars.com database in advance, so it can send you to the same page with a single click. “We don’t handle anything that requires authentication with a username and password, but we do almost everything else,” Wiener says. He claims that by the time Dipsie’s search site becomes publicly available this summer, its index will include 10 billion documents-triple the current size of Google’s index.
So while Google is still king of the hill, the hill itself is now crawling with competitors with their own bright ideas. “Google knows this,” says Gartner analyst Andrews. “They were born at Stanford, and they know there are students in Stanford’s classes who are saying, Hey, I’ve got an idea-what if we take this algorithm and stitch it together with that algorithm?’ They’ve got to either hire the young turks or defeat them.”