Web

Splitting Up Search

(Page 2 of 2)

  • Friday, November 6, 2009
  • By Duncan Graham-Rowe

"It's a valid approach," says Bruce Maggs, a professor of computer science at Duke University in Durham, NC, and vice president of research at Akamai, a Web content delivery and caching company based in Cambridge, MA. Fully replicating a database at multiple sites, as search companies typically do now, is inefficient, Maggs says, since only a small proportion of data is accessed at each site. A distributed approach "also saves considerably on everything else in the same proportion, such as capital costs and real estate," he says. This is because, overall, the number of servers required goes down.

For users, the advantage would be quicker search results. This is because most answers would come from a data center that's geographically closer. A small number of results would take longer than normal--but only 20 to 30 percent longer, says Baeza-Yates. "On average, most queries will be faster," he says.

Maggs says the performance improvement would need to be high enough to counteract any delay in those search queries that have to be sent further afield.

Another trade-off is that more users would get different results, depending on where they were, than is currently the case, says Peter Triantafillou, a researcher at the University of Patras in Greece who studies large-scale search. This already happens to some extent even under a centralized model, he says, but it could be a bigger concern if many more searches were inconsistent.

However, with search engine data centers already housing tens of thousands of servers, it's questionable whether they can continue to grow and still function efficiently, Triantafillou says. "Will they be able to go to hundreds of thousands or millions?" he says. Just the practicality of installing the cabling and optics in and out of such facilities would pose serious problems, he says.

The distributed approach remains a long-term aim, Baeza-Yates admits. "But for the Internet," he adds, "long-term is only about five years."

Print

Related Articles

How Google Ranks Tweets

Algorithms judge the relevance of microblog posts containing 140 characters or less.

Q & A: Mike Lynch

The cofounder and CEO of Autonomy explains why Nicole Kidman is not a cosmic ball of gas.

In Search of What Everyone's Clicking

A real-time search engine bases its results on users' browsing habits.

Close Comments

To comment, please sign in or register

Forgot my password

bagapiev

2 Comments

  • 831 Days Ago
  • 11/06/2009

Distributed Index

Great article, all these points are very true. For disclosure, I am the founder of Wowd, a distributed search & discovery startup which uses a distributed cloud (or p2p) approach to completely distribute index across user desktops.

I was definitely puzzled when I saw in the article that "this (p2p) approach hasn't proven very scalable". On the opposite, our approach is all about scalability and  the scale of our system is limited only by the number of users, something that cannot be said about distributed data centers (one still needs many of them).

In fact, it is quite natural to ask  a question why stop the idea of distribution, as they rightly point out, at the boundary of data centers. One can distribute it on user desktops, with many benefits: great geographic distribution since, on average, there will be users very close by to serve answers; there is also massive replication as well as natural proximity of users and their attention data to the system resources (on the edge of the network).  Our system is in very early stages but we are already seeing benefits of this massive diversity of geo-distribution.

One needs to go no further than BitTorrent to see benefits of massively distributed systems in terms of performance. Of course, the key point is that BitTorrent is a read-only CDN  but we are addressing that point with our DHTFS (DHT-based file system) which allows writes.

In summary, the article makes great points, but they can be naturally extended much further than just distributed data centers, which is something we are in the process of doing :)

Reply

Advertisement

MAGAZINE

Can We Build Tomorrow's Breakthroughs?

Manufacturing in the United States is in trouble. That's bad news not just for the country's economy but for the future of innovation.

Videos

A Social-Media Decoder

More

Advertisement

Technology Review Lists

TR50

Our list of the 50 most innovative companies, including the following:

Lyric Semiconductor

American Superconductor

Calxeda

Silver Spring Networks

More

Advertisement

Facebook

Advertisement