A View from Christopher Mims
You, Too Can Be the Next Google
Former Google engineer Tom Annau is helping upstart search engine Blekko index the entire Web–a problem that’s getting easier, not harder.
“I was at Google for four years and Google obviously has a few more computers than we do,” says Tom Annau, VP of engineering at Blekko, the search engine that claims to be every bit the equal of existing search giants Bing and Google–minus the spam.
“But the amount of actual, useful, interesting information on the internet is not increasing as fast as Moore’s law,” he adds.
Moore’s law isn’t so much a “law” as a general trend in the microprocessor industry–as much a product of a decades-long R&D effort on the part of companies like Intel as the result of any underlying physical principles. It states that the number of microprocessors you can get for a buck roughly doubles every 18 months. This means smaller, faster chips whose powers have grown exponentially since the 1960’s, at least.
In contrast, the growth of the Web may be leveling off. Even so, how could any company, much less a startup, possibly hope to compete with the truly gigantic server and network infrastructure of companies like Google and Microsoft?
“Web search is still an application that pushes the boundaries of current computing devices pretty hard,” says Annau. But Blekko accomplishes a complete, up-to-the-minute index of the Web with less than 1000 servers–fewer, probably, than would be found in any one of Google’s primary data centers–by exploiting paring the problem down to manageable size in ways that the search giants can’t or won’t.
“We try to avoid crawling spam and other bad content,” says Annau. “I think other engines have a crawl first ask questions later policy. One efficiency we gain is just to not crawl splogs [spam blogs] and other machine-generated gibberish.”
Nearly all of the machine-generated content on the Web is produced precisely to entrap the search engine spiders that crawl it, and to cram their indexes with ad-ladened pages. Avoiding these sites all together–using spam-detecting algorithms and human curation–saves Blekko enormous amounts of resources.
“There is a certain extent beyond which, if you keep crawling, you’re not going to crawl any more interesting or useful or good stuff–there’s a lot of diminishing returns on a large Web crawl,” says Annau.
Blekko also employs a tactic used by other search engines–a “split crawl” that thoroughly indexes the entire Web, including much of its static content, less frequently than a fast crawl that constantly re-indexes sites that change frequently, such as news and blogs.
“We aggressively crawl lots of high quality, fast changing sources,” says Annau. “You can see that when do a slash-date search – you can see stuff that’s come in just minute ago.”
Annau’s long view on search is that indexing the entire Web–or at least the useful parts–is becoming more tractable, not less. In this calculus, Microsoft and Google and other large search engines advertise the scale of their data centers in part to intimidate potential rivals; to make the barriers to entry in the search business seem higher than they actually are.
“Whether we succeed or fail as as startup, it will be true that every year that goes by individual servers will become more and more powerful, and the ability to crawl and index the useful info on the Web will actually become more and more affordable,” says Annau.
“Every startup is a hypothesis about some efficiency in the market that they perceive,” he adds. For Blekko and other search engine startups, such as Gabriel Weinberg’s one-man operation Duck Duck Go, that efficiency is that the sum of human knowledge will be bested, handily, by the rate of increase in computing power.
As Rich Skrenta, CEO of Blekko, likes to point out, the growth of Wikipedia is leveling off. “There just aren’t that many topics in the world–people can add and edit pages but you’re not going to just continually see a doubling in its size,” says Annau.
Hear more from Google at EmTech Digital.Register now