A Web Spider for Everyone

A startup uses PC idle time to crawl Web pages on demand.

Erica Naonearchive page

September 25, 2009

As the quantity of information on the Internet continues to grow, so does the question of how to process it all and make it useful. A startup called 80legs, based in Houston, TX, is hoping that an inexpensive, distributed Web crawling service could help startups mine the Web for information without having to build the giant server farms used by major search engines. The company launched this week at DEMO, a conference in San Diego that showcases new companies.

Web crawlers, or spiders, are software that automatically visit pages on the Internet and can be used to index them and gather bits of information from different pages. Crawlers are used by search engines, for example, to monitor the location of information on the Web. But the scale of the Web means that comprehensive crawling consumes a lot of processing power, which typically means building huge data centers to power the software.

80legs hopes to make this technology more accessible to small companies and individuals by allowing leasing access and letting customers pay only for what they crawl.

Web crawling technology is also crucial for semantic sites and services designed to process natural-language queries. While 80legs expects to see users interested in search and semantic applications, CEO Shion Deysarkar says that those testing the service also included customers with less technical interests. Some market researchers, for example, use 80legs to uncover mentions of specific companies or topics across the Web.

A user can start a Web crawl through 80legs’s Web-based interface. The form on the company’s site lets them set parameters for the project and upload custom code needed to control how the crawler does its job. For example, a user might want the crawler to find images and check them against a database of copyrighted ones. Deysarkar says his company’s crawlers are capable of processing up to two billion pages a day. The company charges $2 for every million pages crawled, plus a fee of three cents per hour of processing used.

Many startups struggle to find the funding needed to build large data centers, but that’s not the approach 80legs took to construct its Web crawling infrastructure. The company instead runs its software on a distributed network of personal computers, much like the ones used for projects such as SETI@home. The distributed computing network is put together by Plura Processing, which rents it to 80legs. Plura gets computer users to supply unused processing power in exchange for access to games, donations to charities, and other rewards.

Deysarkar says the approach significantly reduces costs for 80legs, allowing the company to offer its service for far less than would be possible if it used a data center, or even a cloud-computing service such as Amazon Web Services.

Daniel Tunkelang, cofounder of the search company Endeca, based in Cambridge, MA, says that a good Web crawling service could be useful for startups that want to focus on building the search experience rather than on collecting the data. But Tunkelang says the success of 80legs may depend on how easy it is for users to customize the crawl. “The big question is, how adaptive and programmable is the crawl?” he says.

Tunkelang also notes that it’s important for a Web crawler to capture as much information as possible. For example, the path a crawler took to arrive at a particular page can provide a search company with useful information about the contents of that page.

A service such as 80legs could also be useful for university researchers. “Crawling at large scale is indeed an expensive hurdle to cross for experimental search projects in academia, which often are lacking large-scale infrastructure,” says Kevin Chang, an associate professor of computer science at the University of Illinois at Urbana-Champaign.

Chang thinks the distributed nature of 80legs is “an interesting direction and sounds promising [for lowering] the cost of crawling.” At the same time, he agrees that a lot depends on how efficiently the system operates and how effectively users can customize what data they want to process.

80legs plans to launch a market where nontechnical users will be able to purchase applications that can control how a crawler functions. Partner companies will also be able to sell access to applications that control 80legs’s crawlers.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.