The data trail we leave behind us grows all the time. Most of it isn’t that interesting—the takeout meal you ordered, that shower head you bought online—but some of it is deeply personal: your medical diagnoses, your sexual orientation, or your tax records.
The most common way public agencies protect our identities is anonymization. This involves stripping out obviously identifiable things such as names, phone numbers, email addresses, and so on. Data sets are also altered to be less precise, columns in spreadsheets are removed, and “noise” is introduced to the data. Privacy policies reassure us that this means there’s no risk we could be tracked down in the database.
However, a new study in Nature Communications suggests this is far from the case.
Researchers from Imperial College London and the University of Louvain have created a machine-learning model that estimates exactly how easy individuals are to reidentify from an anonymized data set. You can check your own score here, by entering your zip code, gender, and date of birth.
On average, in the US, using those three records, you could be correctly located in an “anonymized” database 81% of the time. Given 15 demographic attributes of someone living in Massachusetts, there’s a 99.98% chance you could find that person in any anonymized database.
“As the information piles up, the chances it isn’t you decrease very quickly,” says Yves-Alexandre de Montjoye, a researcher at Imperial College London and one of the study’s authors.
The tool was created by assembling a database of 210 different data sets from five sources, including the US Census. The researchers fed this data into a machine-learning model, which learned which combinations are more nearly unique and which are less so, and then assigns the probability of correct identification.
This isn’t the first study to show how easy it is to track down individuals from anonymized databases. A paper back in 2007 showed that just a few movie ratings on Netflix can identify a person as easily as a Social Security number, for example. However, it shows just how far current anonymization practices have fallen behind our ability to break them. The fact that the data set is incomplete does not protect people’s privacy, says de Montjoye.
It isn’t all bad news. These same reidentification techniques were used by journalists working at the New York Times earlier this year to expose Donald Trump’s tax returns from 1985 to 1994. However, the same method could be used by someone looking to commit ID fraud or obtain information for blackmail purposes.
“The issue is that we think when data has been anonymized it’s safe. Organizations and companies tell us it’s safe, and this proves it is not,” says de Montjoye.
For peace of mind, companies should be using differential privacy, a complex mathematical model that lets organizations share aggregate data about user habits while protecting an individual’s identity, argues Charlie Cabot, research lead at the privacy engineering firm Privitar.
The technique will get its first major test next year: it’s being used to secure the US Census database.