Skip to Content
Uncategorized

A Startup Hopes to Help Computers Understand Web Pages

Diffbot aims to make it easier for apps to read Web pages the way humans do.

No matter what language you speak, when you look at a Web page, you can get a good idea of the purpose of the different elements on it—whether they’re images, videos, text, music, or ads. It’s not so easy for machines to do the same, though.

That’s where Diffbot hopes to make a difference. The startup, based in Palo Alto, California, offers application programming interfaces that make it possible for machines to “read” the various objects that make up Web pages. This could enable a publisher to repurpose the contents of pages for a mobile app, or help a startup build a price-comparison site.

The company’s efforts come at a time when some tech titans are also working to add more structure to the vast amount of data on the Web. Google, for example, recently unveiled the Knowledge Graph, an effort to identify the meaning of search queries and return relevant results, rather than simply matching the text of a query with Web pages that include the same words. But these efforts usually rely on people to help by tagging Web content to infer meaning.

John Davi, Diffbot’s vice president of product, says that at its heart, the company is about taking the visual learning technology that propels self-driving cars forward on a road and applying it to Web pages.

The idea, which CEO and founder Mike Tung hatched several years ago while he was a graduate student at Stanford, has hummed along since last year. That’s when Diffbot rolled out an API capable of analyzing two types of Web pages on the basis of the URL. On article pages, Diffbot can pick out headlines, the text of articles, pictures, and tags; and on home pages, it can determine basic layout elements like headlines pictures, links to articles, and ads. By now, several thousand programmers are using it to analyze over 100 million URLs each month, Tung says.

There are many more types of Web pages out there, though. The company believes there are roughly 18 main types, ranging from product and job pages to photo galleries. With a $2 million round of funding announced Thursday—its first following an earlier round of seed funding—the company plans to get moving on the 16 other types. This will involve determining what makes up pages of these types—photos, prices, and so on—and using that information to build algorithms that can process unfamiliar pages.

While Diffbot offers its API to customers for free, it charges for high levels of usage. Brad Garlinghouse, the CEO of file-sharing site YouSendIt and an investor in and advisor to Diffbot, says that while the company isn’t currently profitable, it could be without too much trouble.

“They’re solving some here-and-now problems that customers are willing to pay for,” says Garlinghouse.

Currently, a number of Diffbot users are media companies, including Garlinghouse’s previous employer, AOL (Diffbot powers the content aggregation behind AOL’s tablet magazine, Editions). As Davi, of Diffbot, points out, media companies often purchase publications whose online content has been created with a different content-management system. Diffbot’s API can ease the process of consolidating content, he says.

As the company makes it possible to analyze pages of additional types, its founders hope to see Diffbot used for things like product price comparison, photo and recipe aggregation, and more. Tung says, “It’s going to be really exciting to see what people build.”

Keep Reading

Most Popular

10 Breakthrough Technologies 2024

Every year, we look for promising technologies poised to have a real impact on the world. Here are the advances that we think matter most right now.

Scientists are finding signals of long covid in blood. They could lead to new treatments.

Faults in a certain part of the immune system might be at the root of some long covid cases, new research suggests.

AI for everything: 10 Breakthrough Technologies 2024

Generative AI tools like ChatGPT reached mass adoption in record time, and reset the course of an entire industry.

What’s next for AI in 2024

Our writers look at the four hot trends to watch out for this year

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.