Select your localized edition:

Close ×

More Ways to Connect

Discover one of our 28 local entrepreneurial communities »

Be the first to know as we launch in new countries and markets around the globe.

Interested in bringing MIT Technology Review to your local market?

MIT Technology ReviewMIT Technology Review - logo

 

Unsupported browser: Your browser does not meet modern web standards. See how it scores »

Search engines look for clues about the importance of a document or piece of information for a given set of keywords. Often this means relying on what other pages link to–this is how Google’s famous PageRank algorithm works.

Researchers have now developed subtler ways of measuring the influence and importance of documents and pages on the Web and in archives, by using the text stored in those documents. This approach doesn’t rely on people adding pointers such as links and citations, and it could lead to better real-time search engines as well as recommendation systems that automatically gather information on a certain topic.

Software being developed at Princeton University takes an archive of documents and measures changes in language use between documents over time. The sample being analyzed could be a collection of scientific papers or a set of posts from certain blogs. The software analyzes the text in documents and then identifies the most significant words and phrases in particular categories–ones that appear often across many different documents. It then teases out the early appearances of those bits of language to pinpoint the documents that most likely contained ideas that influenced those in other documents. The algorithms can continue to run as items are added to a collection of documents over time.

The researchers tested their algorithms on three large archives containing thousands of journal articles. The papers that the software identified as being influential were also ones that had been cited highly, they found. But their method also provided new insights. In some cases, articles that weren’t cited much were identified as influential. The researchers discovered that these were often early discussions on an important subject. Sometimes articles that were highly cited were not identified as influential; in these cases, the researchers believed that the articles were important resources but did not present new ideas.

“This method captures a different kind of influence,” says David Blei, an assistant professor of computer science at Princeton who led the research. “It sees where a document introduces language and ideas that are picked up on by others.”

This research is part of a larger effort to build new tools for exploring large collections of documents–whether that means the archives of a scientific journal or a mass of blog posts and news articles. “Today, we can easily store all this information and access it, but we need guides to find the most useful content,” Blei says. The important thing, he adds, is to build tools that can make intelligent recommendations for how a user should explore a body of information. Methods that use the contents of the documents, instead of links or citations, are promising, he says.

This approach does require a historical perspective. For journal articles, the researchers looked at changes in language over a period of years. For blog posts, which change more rapidly, the method could work by looking at shifts in language over days or even hours. Blei says that such an approach could be added to search engines’ ranking algorithms to identify important documents, and it could help users navigate vast collections of information more easily.

Measuring information flow to determine influence has a lot of potential, says Jure Leskovec, an assistant professor of computer science in the machine learning department at Stanford University. The most obvious application, he says, is personalization; software could look at what sort of articles a person is reading and point her to articles or websites that contain relevant material.

Leskovec is also working on measuring influence. His research tracks the movement of phrases across the Internet and uses this information to identify sites that are influential in particular subject areas. This has enabled him and his collaborators to write algorithms that can predict how influential a new blog post is likely to be, based on its subject matter and where it appears. Adding a forward-looking perspective could be useful for real-time search, Leskovec says, by giving search engines a new way to rank and filter content more quickly.

6 comments. Share your thoughts »

Credit: Technology Review

Tagged: Web, search, networks, machine learning, algorithms

Reprints and Permissions | Send feedback to the editor

From the Archives

Close

Introducing MIT Technology Review Insider.

Already a Magazine subscriber?

You're automatically an Insider. It's easy to activate or upgrade your account.

Activate Your Account

Become an Insider

It's the new way to subscribe. Get even more of the tech news, research, and discoveries you crave.

Sign Up

Learn More

Find out why MIT Technology Review Insider is for you and explore your options.

Show Me
×

A Place of Inspiration

Understand the technologies that are changing business and driving the new global economy.

September 23-25, 2014
Register »