Select your localized edition:

Close ×

More Ways to Connect

Discover one of our 28 local entrepreneurial communities »

Be the first to know as we launch in new countries and markets around the globe.

Interested in bringing MIT Technology Review to your local market?

MIT Technology ReviewMIT Technology Review - logo

 

Unsupported browser: Your browser does not meet modern web standards. See how it scores »

A software engine that pulls together facts by combing through more than 500 million Web pages has been developed by researchers at the University of Washington. The tool extracts information from billions of lines of text by analyzing basic relationships between words.

Some experts say that this kind of “automated information extraction” will likely form the basis for far more intelligent next-generation Web search, in which nuggets of information are first gleaned and then combined intelligently.

The University of Washington project represents a scaling up of an existing technology developed there called TextRunner in terms of both the number of pages and the scope of topics that it can analyze.

“The significance of TextRunner is that it is scalable because it is unsupervised,” says Peter Norvig, director of research at Google, which donated the database of Web pages that TextRunner analyzes. “It can discover and learn millions of relations, not just one at a time. With TextRunner, there is no human in the loop: it just finds relations on its own.”

Norvig explains that previous technologies have required more guidance from the programmer. For example, to find the names of people who are CEOs within millions of documents, you’d first need to train the software with other examples, such as “Steve Jobs is CEO of Apple, Sheryl Sandberg is CEO of Facebook.” Norvig adds that Google is doing similar work and is already using such technology in limited contexts.

TextRunner gets rid of that manual labor. A user can enter, for example, “kills bacteria,” and the engine will come up with of pages that offer the insights that “chlorine kills bacteria” or “ultraviolet light kills bacteria” or “heat kills bacteria”–results called “triples”–and provide ways to preview the text and then visit the Web page that it comes from.

The prototype still has a fairly simple interface and is not meant for public search so much as to demonstrate the automated extraction of information from 500 million Web pages, says Oren Etzioni, a University of Washington computer scientist leading the project. “What we are showing is the ability of software to achieve rudimentary understanding of text at an unprecedented scale and scope,” he says.

Etizioni says TextRunner’s ability to extract meaning quickly and at huge scale flowed from his group’s discovery of a general model for how relationships are expressed in English that holds true no matter the topic. “For example, the simple pattern ‘entity1, verb, entity2’ covers the relationship ‘Edison invented the light bulb’ as well as ‘Microsoft acquired Farecast’ and many more,” he says. “TextRunner relies on this model, which is automatically learned from text, to analyze sentences and extract triples with high accuracy.”

TextRunner also serves as a starting point for building inferences from natural-language queries, which is what the group is now working on. To give a simple example: if TextRunner finds a Web page that says “mammals are warm blooded” and another Web page that says “dogs are mammals,” an inference engine will produce the information that dogs are probably warm blooded.

This is analogous to technology developed by Powerset, which was acquired by Microsoft last year. Shortly before this acquisition, Powerset unveiled a tool that was limited to extracting facts from only about two million Wikipedia pages. The TextRunner technology handles Wikipedia pages plus arbitrary text on any page, including blog posts, product catalogues, newspaper articles, and more.

“This line of work has been making important advances in the scale at which these tasks can be approached,” says Jon Kleinberg, a computer scientist at Cornell University who has been following the University of Washington search research. He added that “this work reflects a growing trend toward the design of search tools that actively combine the pieces of information they find on the Web into a larger synthesis.”

1 comment. Share your thoughts »

Credit: Technology Review

Tagged: Computing, Google, software, search, artificial intelligence, data mining, semantic web, semantic search

Reprints and Permissions | Send feedback to the editor

From the Archives

Close

Introducing MIT Technology Review Insider.

Already a Magazine subscriber?

You're automatically an Insider. It's easy to activate or upgrade your account.

Activate Your Account

Become an Insider

It's the new way to subscribe. Get even more of the tech news, research, and discoveries you crave.

Sign Up

Learn More

Find out why MIT Technology Review Insider is for you and explore your options.

Show Me