Hello,

We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.

A Startup Hopes to Help Computers Understand Web Pages

Diffbot aims to make it easier for apps to read Web pages the way humans do.

  • by Rachel Metz
  • June 1, 2012
  • Analyze this: Diffbot’s visual learning technology can identify the different items that make up Web pages, like the one shown here.

No matter what language you speak, when you look at a Web page, you can get a good idea of the purpose of the different elements on it—whether they’re images, videos, text, music, or ads. It’s not so easy for machines to do the same, though.

That’s where Diffbot hopes to make a difference. The startup, based in Palo Alto, California, offers application programming interfaces that make it possible for machines to “read” the various objects that make up Web pages. This could enable a publisher to repurpose the contents of pages for a mobile app, or help a startup build a price-comparison site.

The company’s efforts come at a time when some tech titans are also working to add more structure to the vast amount of data on the Web. Google, for example, recently unveiled the Knowledge Graph, an effort to identify the meaning of search queries and return relevant results, rather than simply matching the text of a query with Web pages that include the same words. But these efforts usually rely on people to help by tagging Web content to infer meaning.

John Davi, Diffbot’s vice president of product, says that at its heart, the company is about taking the visual learning technology that propels self-driving cars forward on a road and applying it to Web pages.

The idea, which CEO and founder Mike Tung hatched several years ago while he was a graduate student at Stanford, has hummed along since last year. That’s when Diffbot rolled out an API capable of analyzing two types of Web pages on the basis of the URL. On article pages, Diffbot can pick out headlines, the text of articles, pictures, and tags; and on home pages, it can determine basic layout elements like headlines pictures, links to articles, and ads. By now, several thousand programmers are using it to analyze over 100 million URLs each month, Tung says.

There are many more types of Web pages out there, though. The company believes there are roughly 18 main types, ranging from product and job pages to photo galleries. With a $2 million round of funding announced Thursday—its first following an earlier round of seed funding—the company plans to get moving on the 16 other types. This will involve determining what makes up pages of these types—photos, prices, and so on—and using that information to build algorithms that can process unfamiliar pages.

While Diffbot offers its API to customers for free, it charges for high levels of usage. Brad Garlinghouse, the CEO of file-sharing site YouSendIt and an investor in and advisor to Diffbot, says that while the company isn’t currently profitable, it could be without too much trouble.

“They’re solving some here-and-now problems that customers are willing to pay for,” says Garlinghouse.

Currently, a number of Diffbot users are media companies, including Garlinghouse’s previous employer, AOL (Diffbot powers the content aggregation behind AOL’s tablet magazine, Editions). As Davi, of Diffbot, points out, media companies often purchase publications whose online content has been created with a different content-management system. Diffbot’s API can ease the process of consolidating content, he says.

As the company makes it possible to analyze pages of additional types, its founders hope to see Diffbot used for things like product price comparison, photo and recipe aggregation, and more. Tung says, “It’s going to be really exciting to see what people build.”

Want to go ad free? No ad blockers needed.

Become an Insider
Already an Insider? Log in.

Uh oh–you've read all of your free articles for this month.

Insider Premium
$179.95/yr US PRICE

Want more award-winning journalism? Subscribe to Insider Basic.
  • Insider Basic {! insider.prices.basic !}*

    {! insider.display.menuOptionsLabel !}

    Six issues of our award winning print magazine, unlimited online access plus The Download with the top tech stories delivered daily to your inbox.

    See details+

    What's Included

    Unlimited 24/7 access to MIT Technology Review’s website

    The Download: our daily newsletter of what's important in technology and innovation

    Bimonthly print magazine (6 issues per year)

/
You've read all of your free articles this month. This is your last free article this month. You've read of free articles this month. or  for unlimited online access.