Technology Review - Published By MIT
Advertisement

A Web Spider for Everyone

A startup uses PC idle time to crawl Web pages on demand.

By Erica Naone

Friday, September 25, 2009

smaller text tool iconmedium text tool iconlarger text tool icon

As the quantity of information on the Internet continues to grow, so does the question of how to process it all and make it useful. A startup called 80legs, based in Houston, TX, is hoping that an inexpensive, distributed Web crawling service could help startups mine the Web for information without having to build the giant server farms used by major search engines. The company launched this week at DEMO, a conference in San Diego that showcases new companies.

Credit: Technology Review

Web crawlers, or spiders, are software that automatically visit pages on the Internet and can be used to index them and gather bits of information from different pages. Crawlers are used by search engines, for example, to monitor the location of information on the Web. But the scale of the Web means that comprehensive crawling consumes a lot of processing power, which typically means building huge data centers to power the software.

80legs hopes to make this technology more accessible to small companies and individuals by allowing leasing access and letting customers pay only for what they crawl.

Web crawling technology is also crucial for semantic sites and services designed to process natural-language queries. While 80legs expects to see users interested in search and semantic applications, CEO Shion Deysarkar says that those testing the service also included customers with less technical interests. Some market researchers, for example, use 80legs to uncover mentions of specific companies or topics across the Web.

Story continues below

A user can start a Web crawl through 80legs's Web-based interface. The form on the company's site lets them set parameters for the project and upload custom code needed to control how the crawler does its job. For example, a user might want the crawler to find images and check them against a database of copyrighted ones. Deysarkar says his company's crawlers are capable of processing up to two billion pages a day. The company charges $2 for every million pages crawled, plus a fee of three cents per hour of processing used.

Many startups struggle to find the funding needed to build large data centers, but that's not the approach 80legs took to construct its Web crawling infrastructure. The company instead runs its software on a distributed network of personal computers, much like the ones used for projects such as SETI@home. The distributed computing network is put together by Plura Processing, which rents it to 80legs. Plura gets computer users to supply unused processing power in exchange for access to games, donations to charities, and other rewards.

Comments

  • Great
    another Majestik pounding away at websites?  When I get a bot hitting my sites I find this is a treasure trove IP address ranges for server farms I can also ban.  real customers / visitors are generally not coming from server farms. 

    Having them come from end user ranges will make banning more than just the single IP address a bit more difficult.  Every one of these search engines think they have the god given right to pound away at your website, and also ignore your robots file.

    And copy your content for god-knows-what use.

    IncrediBill on webmaster world seems to express my views the best with his endless rants against stupid bots, & who mostly never produce a product end users can see, or are taking the info for unspecified hidden commercial use or spam harvesting emails.
    Rate this comment: 12345

    erbium
    09/27/2009
    Posts:108
    Avg Rating:
    3/5
  • On-Demand Web Spider
    80legs.com technology seems very interesting.  Another service called BuildaSearch.com offers very similar services but at a fixed cost and without the distributed computers.  Would need to test 80legs.com to determine which on-demand spidering system works better and produces better results.  
    Rate this comment: 12345

    unitedcolors
    09/29/2009
    Posts:1
  • b12
    I think your reviews are extremely well done and your choice of language makes them even more interesting. An occasional bit of irony or an elegant twist in the phrase is a welcome relief.
    Rate this comment: 12345

    alexiaalline
    10/15/2009
    Posts:3
    Avg Rating:
    2/5

Log In

Forgot your password?     Register »
Advertisement
Advertisement
Advertisement
Subscribe to Technology Review's daily e-mail update. Enter your e-mail address

TECHNOLOGY RESOURCES
Advertisement
MIT Massachusetts Institute of Technology © 2009 Technology Review. All Rights Reserved.