Technology Review - Published By MIT
Advertisement

Splitting Up Search

Distributing a search engine's index around the world could make it faster and more efficient, researchers say.

By Duncan Graham-Rowe

Friday, November 06, 2009

smaller text tool iconmedium text tool iconlarger text tool icon

Searching the Web could become faster for users and much more efficient for search companies if search engines were split up and distributed around the world, according to researchers at Yahoo.

Credit: Technology Review

Currently, search engines are based on a centralized model, explains Ricardo Baeza-Yates, a researcher at Yahoo's Labs in Barcelona, Spain. This means that a search engine's index--the core database that lists the location and relative importance of information stored across the Web--as well as additional data, such as cached copies of content, are replicated within several data centers at different locations. The tendency among search companies, says Baeza-Yates, has been to operate a relatively small number of very large data centers across the globe.

Baeza-Yates and his colleagues devised another way: a "distributed" approach, with both the search index and the additional data spread out over a larger number of smaller data centers. With this approach, smaller data centers would contain locally relevant information and a small proportion of globally replicated data. Many search queries common to a particular area could be answered using the content stored in a local data center, while other queries would be passed on to different data centers.

"Many people have talked about this in the past," says Baeza-Yates. But there was resistance, he says, because many assumed that such an approach would be too slow or expensive. It was also unclear how to ensure that each query got the best global result and not just the best that the local center had to offer. A few start-up companies have even launched peer-to-peer search engines that harness the power of users' own machines. But this approach hasn't proven very scalable.

To achieve a workable distributed system, Baeza-Yates and colleagues designed it so that statistical information about page rankings could be shared between the different data centers. This would allow each data center to run an algorithm that compares its results with those of others. If another data center gave a statistically better result, the query would be forwarded to it.

Story continues below

The group put the distributed approach to the test in a feasibility study, using real search data. They present their findings this week at the Association for Computing Machinery's Conference on Information and Knowledge Management in Hong Kong, where they will receive the award for the best paper.

"We wanted to prove that we could achieve the same performance [as the centralized model] without it costing too much," says Baeza-Yates. In fact, they found that their approach could reduce the overall costs of operating a search engine by as much as 15 percent without compromising the quality of the answers.

Comments

  • Distributed Index
    Great article, all these points are very true. For disclosure, I am the founder of Wowd, a distributed search & discovery startup which uses a distributed cloud (or p2p) approach to completely distribute index across user desktops.

    I was definitely puzzled when I saw in the article that "this (p2p) approach hasn't proven very scalable". On the opposite, our approach is all about scalability and  the scale of our system is limited only by the number of users, something that cannot be said about distributed data centers (one still needs many of them).

    In fact, it is quite natural to ask  a question why stop the idea of distribution, as they rightly point out, at the boundary of data centers. One can distribute it on user desktops, with many benefits: great geographic distribution since, on average, there will be users very close by to serve answers; there is also massive replication as well as natural proximity of users and their attention data to the system resources (on the edge of the network).  Our system is in very early stages but we are already seeing benefits of this massive diversity of geo-distribution.

    One needs to go no further than BitTorrent to see benefits of massively distributed systems in terms of performance. Of course, the key point is that BitTorrent is a read-only CDN  but we are addressing that point with our DHTFS (DHT-based file system) which allows writes.

    In summary, the article makes great points, but they can be naturally extended much further than just distributed data centers, which is something we are in the process of doing :)
    Rate this comment: 12345

    bagapiev
    11/06/2009
    Posts:1
    Avg Rating:
    4/5

Log In

Forgot your password?     Register »
Advertisement

Videos

Making 3D Maps on the Move
Technology Review November/December 2009

Current Issue

Natural Gas Changes the Energy Map
The United States has vast supplies of this cleaner fossil fuel. But how should we use it?
Featured Content
Sponsored by:
White Papers

Twelve ways to reduce costs with SQL Server 2008
Find out how to reduce costs and get more efficient

Download

Total Economic Impact of SQL Server 2008 Upgrade
Forrester reports on increasing productivity and management capabilities

Download 

Achieving Cost and Resource Savings with UC
How Office Communications Server R2 and Exchange Server can make your business smarter and more efficient

Download 

The Compelling Case for Conferencing
Read how you can improve workload support and find IT efficiencies

Download

How Windows Server 2008 R2 Helps Optimize IT and Save you Money
Read how you can improve workload support and find IT efficiencies

Download

Windows Server 2008 R2 Hyper-V Live Migration
See how Windows Server 2008 R2 and Hyper-V enable virtualization and Live Migration

Download
Advertisement
Subscribe to Technology Review's daily e-mail update. Enter your e-mail address

TECHNOLOGY RESOURCES
Advertisement
MIT Massachusetts Institute of Technology © 2009 Technology Review. All Rights Reserved.