When Social Media Mining Gets It Wrong

A complex picture of your personal life can now be pieced together using a variety of public data sources, and increasingly sophisticated data-mining techniques. But just how accurate is that picture?

Last week in Las Vegas, at the computer security conference Black Hat, Alessandro Acquisti, an associate professor of information technology and public policy at the Heinz College at Carnegie Mellon University, showed how a photograph of a person can be used to find his or her date of birth, social security number, and other information by using facial recognition technology to match the image to a profile on Facebook and other websites. Acquisti acknowledges the privacy implications of this work, but he warns that the biggest problem could be the inaccuracy of this and other data-mining techniques.

Acquisti says that his current work is an attempt “to capture the future we are walking into.” In this future, he sees online information being used to prejudge a person on many levels—as a prospective date, borrower, employee, tenant, and so on. The Internet, he says, could become “a place where everyone knows your name”—a worldwide small town that won’t let you live anything down.

Beyond the obvious concerns about strangers knowing more than ever about you, Acquisti worries about what will happen when the technology makes mistakes. “We tend to make strong extrapolations about weak data,” says Acquisti. “It’s impossible to fight that, because it’s in our nature.”

A number of companies have already begun using social media to measure and track reputation. The Santa Barbara, California, company Social Intelligence, for example, performs social-media background screenings on prospective employees, promising to reveal negative information such as racist remarks or sexually explicit photos, or positive information such as signs of social media influence within a specific field. Other companies, such as Klout, track users’ level of social influence, allowing advertisers to offer special rewards to those with high scores.

But Acquisti’s research demonstrated the pitfalls of placing too much relevance on social networking data. His team took photos of volunteers and used an off-the-shelf face recognizer called PittPatt (recently acquired by Google) to find each volunteer’s Facebook profile—which often revealed that person’s real name and much more personal information. Using this information, the team could sometimes figure out part of a person’s social security number. They also created a prototype smart-phone app that pulls up personal information about a person after they are snapped with the device’s camera.

In their experiment, the team was able to match about one-third of subjects to the correct profiles. From there, they made other predictions. Seventy-five percent of the time, they correctly predicted subjects’ interests. They correctly predicted the first five digits of volunteers’ social security numbers about 16 percent of the time given two tries. (Accuracy increased with more attempts.)

But this means that two-thirds of the time, they did not identify people correctly. And those who were correctly identified were still incorrectly matched 25 percent of the time to particular personal interests, and more than 80 percent of the time to the wrong social security number.

Acquisti expects facial recognition technology to continue improving in coming years, and he asks what will happen once it is considered good enough to be trusted most of the time. It could be nightmarish for those who are misidentified. “There’s nothing that we, as individuals, can control,” he says.

Other researchers are exploring the reliability of mining social data. At Defcon, a hacking conference in Las Vegas last weekend, a group called the Online Privacy Foundation presented results of its “Big Five Experiment,” a study that aimed to match volunteers’ personality traits to qualities on Facebook profiles. After administering a personality test to volunteers, they mined profiles to identify key characteristics.

The Online Privacy Foundation researchers found a positive correlation between people whose personalities tended toward openness and those whose Facebook profiles were loaded with more information: longer lists of interests, longer bios, and more discussion of money, religion, death, and negative emotions. They also found a positive correlation between “agreeable people”—defined as “being compassionate, cooperative, having the ability to forgive and be pragmatic”—and Facebook statuses that were written in longer sentences, that discussed positive emotions, or had relatively more comments, friends, and photos. However, in both cases, the correlations were relatively weak.

The researchers conclude that a Facebook profile is hardly a reliable source of information. “The key point is to remember that this is a bet,” says the foundation’s cofounder Chris Sumner. “The message is that, yes, there is a link, but don’t use it on its own for critical decisions.”

Acquisti and Sumner say that new government policies may be needed to protect individuals from excessive data mining and from the misuse of their information. This could involve setting standards of accuracy for organizations to abide by. “The defining question of our time,” Acquisti says, “is how do we, as a society, deal with big data?”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.