Technology Review

Web

Splitting Up Search

Distributing a search engine's index around the world could make it faster and more efficient, researchers say.

  • Friday, November 6, 2009
  • By Duncan Graham-Rowe

Searching the Web could become faster for users and much more efficient for search companies if search engines were split up and distributed around the world, according to researchers at Yahoo.

Currently, search engines are based on a centralized model, explains Ricardo Baeza-Yates, a researcher at Yahoo's Labs in Barcelona, Spain. This means that a search engine's index--the core database that lists the location and relative importance of information stored across the Web--as well as additional data, such as cached copies of content, are replicated within several data centers at different locations. The tendency among search companies, says Baeza-Yates, has been to operate a relatively small number of very large data centers across the globe.

Baeza-Yates and his colleagues devised another way: a "distributed" approach, with both the search index and the additional data spread out over a larger number of smaller data centers. With this approach, smaller data centers would contain locally relevant information and a small proportion of globally replicated data. Many search queries common to a particular area could be answered using the content stored in a local data center, while other queries would be passed on to different data centers.

"Many people have talked about this in the past," says Baeza-Yates. But there was resistance, he says, because many assumed that such an approach would be too slow or expensive. It was also unclear how to ensure that each query got the best global result and not just the best that the local center had to offer. A few start-up companies have even launched peer-to-peer search engines that harness the power of users' own machines. But this approach hasn't proven very scalable.

Advertisement

To achieve a workable distributed system, Baeza-Yates and colleagues designed it so that statistical information about page rankings could be shared between the different data centers. This would allow each data center to run an algorithm that compares its results with those of others. If another data center gave a statistically better result, the query would be forwarded to it.

The group put the distributed approach to the test in a feasibility study, using real search data. They present their findings this week at the Association for Computing Machinery's Conference on Information and Knowledge Management in Hong Kong, where they will receive the award for the best paper.

"We wanted to prove that we could achieve the same performance [as the centralized model] without it costing too much," says Baeza-Yates. In fact, they found that their approach could reduce the overall costs of operating a search engine by as much as 15 percent without compromising the quality of the answers.

Print

Related Articles

How Google Ranks Tweets

Algorithms judge the relevance of microblog posts containing 140 characters or less.

Q & A: Mike Lynch

The cofounder and CEO of Autonomy explains why Nicole Kidman is not a cosmic ball of gas.

In Search of What Everyone's Clicking

A real-time search engine bases its results on users' browsing habits.

Close Comments

To comment, please sign in or register

Forgot my password

bagapiev

2 Comments

  • 831 Days Ago
  • 11/06/2009

Distributed Index

Great article, all these points are very true. For disclosure, I am the founder of Wowd, a distributed search & discovery startup which uses a distributed cloud (or p2p) approach to completely distribute index across user desktops.

I was definitely puzzled when I saw in the article that "this (p2p) approach hasn't proven very scalable". On the opposite, our approach is all about scalability and  the scale of our system is limited only by the number of users, something that cannot be said about distributed data centers (one still needs many of them).

In fact, it is quite natural to ask  a question why stop the idea of distribution, as they rightly point out, at the boundary of data centers. One can distribute it on user desktops, with many benefits: great geographic distribution since, on average, there will be users very close by to serve answers; there is also massive replication as well as natural proximity of users and their attention data to the system resources (on the edge of the network).  Our system is in very early stages but we are already seeing benefits of this massive diversity of geo-distribution.

One needs to go no further than BitTorrent to see benefits of massively distributed systems in terms of performance. Of course, the key point is that BitTorrent is a read-only CDN  but we are addressing that point with our DHTFS (DHT-based file system) which allows writes.

In summary, the article makes great points, but they can be naturally extended much further than just distributed data centers, which is something we are in the process of doing :)

Reply

Advertisement

MAGAZINE

Can We Build Tomorrow's Breakthroughs?

Manufacturing in the United States is in trouble. That's bad news not just for the country's economy but for the future of innovation.

Videos

A Social-Media Decoder

More

Advertisement

Technology Review Lists

TR50

Our list of the 50 most innovative companies, including the following:

Joule Unlimited

Apple

BIND Biosciences

Zynga

More

Advertisement

Facebook

Advertisement