Unmasking Social-Network Users

One way for social networks to make money is by sharing information about users with advertisers and others who are interested in understanding consumer behavior and exploiting online trends.

Social networks typically promise to remove “personally identifying information” before sharing this data, to protect users’ privacy. But researchers from the University of Texas at Austin have found that, combined with readily available data from other online sources, this anonymized data can still reveal sensitive information about users.

In tests involving the photo-sharing site Flickr and the microblogging service Twitter, the Texas researchers were able to identify a third of the users with accounts on both sites simply by searching for recognizable patterns in anonymized network data. Both Twitter and Flickr display user information publicly, so the researchers anonymized much of the data in order to test their algorithms.

The researchers wanted to see if they could extract sensitive information about individuals using just the connections between users, even if almost all of the names, addresses, and other forms of personally identifying information had been removed. They found that they could, provided they could compare these patterns with those from another social-network graph where some user information was accessible.

Data from social networks–particularly the pattern of friendship between users–can be valuable to advertisers, says Vitaly Shmatikov, a professor of computer science at the University of Texas at Austin, who was involved in the research. Most social networks plan to make money by sharing this information, while advertisers hope to employ it to, for example, find a particularly influential user and target her with advertising to reach her network of friends. But Shmatikov says that this information also makes networks vulnerable. “When you release this data, you have to preserve the structure of the social network,” he says. “If you don’t, then probably it’s useless for the purpose for which you are releasing it.”

The researchers say that it is fairly easy to find nonanonymous social-network data: the connections between friends in many networks, such as Twitter, are made public by default. Meanwhile, efforts to create a universal “social graph,” such as with OpenSocial, provide even more resources. The researchers’ algorithms worked with only a 12 percent error rate even when the patterns of social connections were significantly different: only 14 percent of users’ relationships overlapped from Twitter to Flickr. The results are described in a paper to be presented later this month at the IEEE Symposium on Security and Privacy.

“The structure of the network around you is so rich, and there are so many different possibilities, that even though you have millions of people participating in the network, we all end up with different networks around us,” says Shmatikov. “Once you deal with sufficiently sophisticated human behavior, whether you’re talking about purchases people make or movies they view or–in this case–friends they make and how they behave socially, people tend to be fairly unique. Every person does a few quirky, individual things which end up being strongly identifying.”

To give the algorithm a starting point, the researchers also need to identify a few users from an anonymous social-network graph. But they say that this is easy to do on many social networks. A portion of users of Facebook, for example, choose to make their profiles public, and an attacker could use this as the starting point. In their experiments, the researchers found that they needed to identify as few as 30 individuals in order to be able to run their algorithms on networks of 100,000 users or more.

The researchers add that the algorithm uses the smallest amount of information feasible and that, in practice, a determined snoop would be able to find much more. “This attack would have been much, much stronger if we’d actually used information that is typically left after [names and addresses] have been removed,” says Shmatikov. “So we’re really showing how the bare minimum is enough.”

“It’s important research,” says Alessandro Acquisti, an associate professor of information technology and public policy at Carnegie Mellon University and an expert on privacy online. The research highlights how data that might not seem important can actually provide an attacker with the means to uncover truly sensitive information, Acquisti says. For example, the algorithm could theoretically employ the names of a user’s favorite bands and concert-going friends to decode sensitive details such as sexual orientation from supposedly anonymized data. Acquisti believes that the result paints a bleak picture for the future of online privacy. “There is no such thing as complete anonymity,” he says. “It’s impossible.”

Shmatikov does think that there is no technical solution to the problem. He suggests that privacy laws and corporate practices may need to be changed to recognize that there’s no way to anonymize social-network data. Users should also be able to decide whether to allow their data to be shared in the first place, Shmatikov says.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.