“It’s a valid approach,” says Bruce Maggs, a professor of computer science at Duke University in Durham, NC, and vice president of research at Akamai, a Web content delivery and caching company based in Cambridge, MA. Fully replicating a database at multiple sites, as search companies typically do now, is inefficient, Maggs says, since only a small proportion of data is accessed at each site. A distributed approach “also saves considerably on everything else in the same proportion, such as capital costs and real estate,” he says. This is because, overall, the number of servers required goes down.
For users, the advantage would be quicker search results. This is because most answers would come from a data center that’s geographically closer. A small number of results would take longer than normal–but only 20 to 30 percent longer, says Baeza-Yates. “On average, most queries will be faster,” he says.
Maggs says the performance improvement would need to be high enough to counteract any delay in those search queries that have to be sent further afield.
Another trade-off is that more users would get different results, depending on where they were, than is currently the case, says Peter Triantafillou, a researcher at the University of Patras in Greece who studies large-scale search. This already happens to some extent even under a centralized model, he says, but it could be a bigger concern if many more searches were inconsistent.
However, with search engine data centers already housing tens of thousands of servers, it’s questionable whether they can continue to grow and still function efficiently, Triantafillou says. “Will they be able to go to hundreds of thousands or millions?” he says. Just the practicality of installing the cabling and optics in and out of such facilities would pose serious problems, he says.
The distributed approach remains a long-term aim, Baeza-Yates admits. “But for the Internet,” he adds, “long-term is only about five years.”