Search engines look for clues about the importance of a document or piece of information for a given set of keywords. Often this means relying on what other pages link to–this is how Google’s famous PageRank algorithm works.
Researchers have now developed subtler ways of measuring the influence and importance of documents and pages on the Web and in archives, by using the text stored in those documents. This approach doesn’t rely on people adding pointers such as links and citations, and it could lead to better real-time search engines as well as recommendation systems that automatically gather information on a certain topic.
Software being developed at Princeton University takes an archive of documents and measures changes in language use between documents over time. The sample being analyzed could be a collection of scientific papers or a set of posts from certain blogs. The software analyzes the text in documents and then identifies the most significant words and phrases in particular categories–ones that appear often across many different documents. It then teases out the early appearances of those bits of language to pinpoint the documents that most likely contained ideas that influenced those in other documents. The algorithms can continue to run as items are added to a collection of documents over time.
The researchers tested their algorithms on three large archives containing thousands of journal articles. The papers that the software identified as being influential were also ones that had been cited highly, they found. But their method also provided new insights. In some cases, articles that weren’t cited much were identified as influential. The researchers discovered that these were often early discussions on an important subject. Sometimes articles that were highly cited were not identified as influential; in these cases, the researchers believed that the articles were important resources but did not present new ideas.
“This method captures a different kind of influence,” says David Blei, an assistant professor of computer science at Princeton who led the research. “It sees where a document introduces language and ideas that are picked up on by others.”
This research is part of a larger effort to build new tools for exploring large collections of documents–whether that means the archives of a scientific journal or a mass of blog posts and news articles. “Today, we can easily store all this information and access it, but we need guides to find the most useful content,” Blei says. The important thing, he adds, is to build tools that can make intelligent recommendations for how a user should explore a body of information. Methods that use the contents of the documents, instead of links or citations, are promising, he says.
This approach does require a historical perspective. For journal articles, the researchers looked at changes in language over a period of years. For blog posts, which change more rapidly, the method could work by looking at shifts in language over days or even hours. Blei says that such an approach could be added to search engines’ ranking algorithms to identify important documents, and it could help users navigate vast collections of information more easily.
Measuring information flow to determine influence has a lot of potential, says Jure Leskovec, an assistant professor of computer science in the machine learning department at Stanford University. The most obvious application, he says, is personalization; software could look at what sort of articles a person is reading and point her to articles or websites that contain relevant material.
Leskovec is also working on measuring influence. His research tracks the movement of phrases across the Internet and uses this information to identify sites that are influential in particular subject areas. This has enabled him and his collaborators to write algorithms that can predict how influential a new blog post is likely to be, based on its subject matter and where it appears. Adding a forward-looking perspective could be useful for real-time search, Leskovec says, by giving search engines a new way to rank and filter content more quickly.