We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.


A Free Database of the Entire Web May Spawn the Next Google

Common Crawl supplies a database of over five billion Web pages in the hope that it will inspire new research or online services.

A freely available copy of billions of Web pages could create competition for established giants such as Google.

Google famously started out as little more than a more efficient algorithm for ranking Web pages. But the company also built its success on crawling the Web—using software that visits every page in order to build up a vast index of online content.

A nonprofit called Common Crawl is now using its own Web crawler and making a giant copy of the Web that it makes accessible to anyone. The organization offers up over five billion Web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Google’s.

“The Web represents, as far as I know, the largest accumulation of knowledge, and there’s so much you can build on top,” says entrepreneur Gilad Elbaz, who founded Common Crawl. “But simply doing the huge amount of work that’s necessary to get at all that information is a large blocker; few organizations … have had the resources to do that.”

New search engines are just one of the things that can be built using an index of the Web, says Elbaz, who points out that Google’s translation software was trained using online text available in multiple languages. “The only way they could do that was by starting with a massive crawl. That’s put them on the way to build the Star Trek translator,” he says. “Having an open, shared corpus of human knowledge is simply a way of democratizing access to information that’s fundamental to innovation.”

Elbaz says he noticed around five years ago that researchers with new ideas about how to use Web data felt compelled to take jobs at Google because it was the only place they could test those ideas. He says Common Crawl’s data will make it easier for novel ideas to gain traction, both in the world of startups and in academic research.

Elbaz is the founder and CEO of big data company Factual, and before that founded a company bought by Google to be the basis of its ad business for Web pages. Common Crawl also has Google’s director of research, Peter Norvig, and MIT Media Lab director Joi Ito on its advisory board.

Common Crawl has so far indexed more than five billion pages, adding up to 81 terabytes of data, made available through Amazon’s cloud computing service. For about $25 a programmer could set up an account with Amazon and get to work crunching Common Crawl data, says Lisa Green, Common Crawl’s director. The Internet Archive, another nonprofit, also compiles a copy of the Web and offers a service called the “Wayback Machine” that can show old versions of a particular page. However, it doesn’t allow anyone to analyze all its data at once in that way.

Common Crawl has already inspired or helped out some new Web startups. TinEye, a “reverse” search engine that finds images similar to one provided by the user, made use of early Common Crawl data to get started. One programmer’s personal project using Common Crawl data to measure how many of the Web’s pages connect to Facebook—some 22 percent, he concluded—led to his securing funding for a startup, Lucky Oyster, based on helping people find useful information in their social data.

Other ideas enabled by the project emerged from a contest run last year that awarded prizes for the best use cases. One of the winners used Wikipedia links in crawl data to build a service capable of defining the meanings of words; another tried to determine public attitudes toward congressional legislation by analyzing the content of online discussions about new laws.

Rich Skrenta, cofounder and CEO of search engine startup Blekko (see “As Google Tinkers With Search, Upstarts Gain Ground”), says Common Crawl’s data fulfills a definite need in the startup community. He says Blekko has been approached by startups with technology needing access to large collections of online data. “That kind of data is now easily available from Common Crawl,” says Skrenta, whose company contributed some of its own data to the project in December 2012. Blekko shared information from its system that categorizes Web pages by content, for example labeling whether they contain pornography or spam.

Ben Zhao, an associate professor at the University of California, Santa Barbara, who uses large collections of Web data for research into activity on social sites (see “Hidden Industry Dupes Social Media Users”), says Common Crawl’s data is likely unique. “Fresh, large-scale crawls are quite rare, and I am not personally aware of places to get large crawl data on the Web,” he says.

However, Zhao notes that some of the most interesting and valuable parts of the Web won’t be well represented in Common Crawl’s data: “Social sites are quite sensitive about their content these days, and many implement anti-crawling mechanisms to limit the speed anyone can access their content.”

To access this data, researchers must strike up relationships with companies and rely on whatever they will release—a route less available to startups who may be seen as competition.

Want to go ad free? No ad blockers needed.

Become an Insider
Already an Insider? Log in.
More from Connectivity

What it means to be constantly connected with each other and vast sources of information.

Want more award-winning journalism? Subscribe to Insider Plus.
  • Insider Plus {! insider.prices.plus !}*

    {! insider.display.menuOptionsLabel !}

    Everything included in Insider Basic, plus the digital magazine, extensive archive, ad-free web experience, and discounts to partner offerings and MIT Technology Review events.

    See details+

    Print + Digital Magazine (6 bi-monthly issues)

    Unlimited online access including all articles, multimedia, and more

    The Download newsletter with top tech stories delivered daily to your inbox

    Technology Review PDF magazine archive, including articles, images, and covers dating back to 1899

    10% Discount to MIT Technology Review events and MIT Press

    Ad-free website experience

You've read of three free articles this month. for unlimited online access. You've read of three free articles this month. for unlimited online access. This is your last free article this month. for unlimited online access. You've read all your free articles this month. for unlimited online access. You've read of three free articles this month. for more, or for unlimited online access. for two more free articles, or for unlimited online access.