A Free Database of the Entire Web May Spawn the Next Google

Common Crawl supplies a database of over five billion Web pages in the hope that it will inspire new research or online services.

Tom Simonitearchive page

January 23, 2013

Google famously started out as little more than a more efficient algorithm for ranking Web pages. But the company also built its success on crawling the Web—using software that visits every page in order to build up a vast index of online content.

A nonprofit called Common Crawl is now using its own Web crawler and making a giant copy of the Web that it makes accessible to anyone. The organization offers up over five billion Web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Google’s.

“The Web represents, as far as I know, the largest accumulation of knowledge, and there’s so much you can build on top,” says entrepreneur Gilad Elbaz, who founded Common Crawl. “But simply doing the huge amount of work that’s necessary to get at all that information is a large blocker; few organizations … have had the resources to do that.”

New search engines are just one of the things that can be built using an index of the Web, says Elbaz, who points out that Google’s translation software was trained using online text available in multiple languages. “The only way they could do that was by starting with a massive crawl. That’s put them on the way to build the Star Trek translator,” he says. “Having an open, shared corpus of human knowledge is simply a way of democratizing access to information that’s fundamental to innovation.”

Elbaz says he noticed around five years ago that researchers with new ideas about how to use Web data felt compelled to take jobs at Google because it was the only place they could test those ideas. He says Common Crawl’s data will make it easier for novel ideas to gain traction, both in the world of startups and in academic research.

Elbaz is the founder and CEO of big data company Factual, and before that founded a company bought by Google to be the basis of its ad business for Web pages. Common Crawl also has Google’s director of research, Peter Norvig, and MIT Media Lab director Joi Ito on its advisory board.

Common Crawl has so far indexed more than five billion pages, adding up to 81 terabytes of data, made available through Amazon’s cloud computing service. For about $25 a programmer could set up an account with Amazon and get to work crunching Common Crawl data, says Lisa Green, Common Crawl’s director. The Internet Archive, another nonprofit, also compiles a copy of the Web and offers a service called the “Wayback Machine” that can show old versions of a particular page. However, it doesn’t allow anyone to analyze all its data at once in that way.

Common Crawl has already inspired or helped out some new Web startups. TinEye, a “reverse” search engine that finds images similar to one provided by the user, made use of early Common Crawl data to get started. One programmer’s personal project using Common Crawl data to measure how many of the Web’s pages connect to Facebook—some 22 percent, he concluded—led to his securing funding for a startup, Lucky Oyster, based on helping people find useful information in their social data.

Other ideas enabled by the project emerged from a contest run last year that awarded prizes for the best use cases. One of the winners used Wikipedia links in crawl data to build a service capable of defining the meanings of words; another tried to determine public attitudes toward congressional legislation by analyzing the content of online discussions about new laws.

Rich Skrenta, cofounder and CEO of search engine startup Blekko (see “As Google Tinkers With Search, Upstarts Gain Ground”), says Common Crawl’s data fulfills a definite need in the startup community. He says Blekko has been approached by startups with technology needing access to large collections of online data. “That kind of data is now easily available from Common Crawl,” says Skrenta, whose company contributed some of its own data to the project in December 2012. Blekko shared information from its system that categorizes Web pages by content, for example labeling whether they contain pornography or spam.

Ben Zhao, an associate professor at the University of California, Santa Barbara, who uses large collections of Web data for research into activity on social sites (see “Hidden Industry Dupes Social Media Users”), says Common Crawl’s data is likely unique. “Fresh, large-scale crawls are quite rare, and I am not personally aware of places to get large crawl data on the Web,” he says.

However, Zhao notes that some of the most interesting and valuable parts of the Web won’t be well represented in Common Crawl’s data: “Social sites are quite sensitive about their content these days, and many implement anti-crawling mechanisms to limit the speed anyone can access their content.”

To access this data, researchers must strike up relationships with companies and rely on whatever they will release—a route less available to startups who may be seen as competition.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.