Websites where users can organize and share information are flourishing, but it can be hard to know which users and information to trust. Now a team of European researchers has developed an algorithm that ranks the expertise of users and can spot those who are using a site only to spam.
The technique works in a way similar to Amazon’s reputation engine or the ratings of Wikipedia pages, but it evaluates users based on a new set of criteria that makes intuitive assumptions about experts.
The algorithm draws on a method applied in ranking Web pages, but takes it an interesting step further, says Jon Kleinberg, a professor of computer science at Cornell University in Ithaca, NY, who was not involved with the work. “It distinguishes between ‘discoverers’ and ‘followers,’” Kleinberg says, “focusing on users who are the first to tag something that subsequently becomes popular.”
The new work focuses on collaborative tagging systems such as Delicious, a social bookmarking website, and Flickr, a photo-sharing site. These sites let users add relevant keywords to “tag” Web links or photos and then share them. Normally, users are ranked by how frequently or how recently they add content to the system. “It’s quantity over quality, so the more you do, the more credit you get,” says Michael Noll, a researcher in computer science at Hasso Plattner Institute in Potsdam, Germany, and a researcher on the new software. “But the fact is [that] quantity does not imply quality.”
The conventional approach also leaves the system very vulnerable to Web spammers, says Ciro Cattuto, a researcher at the Complex Network and Systems Group of the Institute for Scientific Interchange Foundation in Italy. Spammers adapt to the social behavior of other users, Cattuto says, so they see the most popular tags and start loading advertising content with those tags. To combat this, you need an algorithm that can search, rank, and present information in a usable way, says Cattuto. “The new method performs better than anything currently available–spammers rank very low, their content is not exposed, and eventually they stop polluting the system.”
The new algorithm is called Spamming-resistant Expertise Analysis and Ranking (SPEAR) and is based on the well-known information-retrieval algorithm called HITS that is used by search engines like Google to rank Web pages. Like HITS, SPEAR is a method of “mutual reinforcement,” says Kleinberg. In other words, the algorithm evaluates popular users and popular content and declares expert users to be the ones who identify the most important content, while important content is that which is identified by the most expert users. “The result is a way of identifying both expert users and high-quality content,” he says.
To rate a person’s level of expertise–as “good,” “average,” or “novice”–Noll’s team integrated a second factor into their algorithm: temporal information. “The idea is that the early bird gets the worm,” says Ching-man Au Yeung, a researcher in electronics and computer science at the University of Southampton in the U.K., who collaborated with Noll on the development of the algorithm. Those people who first discover content that subsequently receives a lot of tagging can be identified as trend setters in a community. “They are finding the usefulness of a document before others do,” says Au Yeung, who compares their acquisition of influence to the way a knowledgeable academic builds a reputation.
In contrast, followers find useful content later and tag it because it is already popular. These are more likely to be spammers, “people who identify a topic that grows in importance and use it to point to their own stuff,” says Scott Golder, formerly a research scientist at Hewlett Packard and currently a graduate student at Cornell. Golder adds that the SPEAR algorithm employs “a very smart set of criteria that has not been used before in computer science.”
Noll says that the algorithm can be adjusted for any online community, including Twitter and music-sharing sites. The work was presented last week at the SIGIR Conference in Boston. Noll says that companies including Microsoft were interested in using the algorithm for social Web search, where documents are ranked based on users’ bookmarks.
“I’d expect … this combination of mutual reinforcement with the distinction between discoverers and followers to be useful in many domains,” says Kleinberg.