The data trail we leave behind us grows all the time. Most of it isn’t that interesting—the takeout meal you ordered, that shower head you bought online—but some of it is deeply personal: your medical diagnoses, your sexual orientation, or your tax records.
The most common way public agencies protect our identities is anonymization. This involves stripping out obviously identifiable things such as names, phone numbers, email addresses, and so on. Data sets are also altered to be less precise, columns in spreadsheets are removed, and “noise” is introduced to the data. Privacy policies reassure us that this means there’s no risk we could be tracked down in the database.
However, a new study in Nature Communications suggests this is far from the case.
Researchers from Imperial College London and the University of Louvain have created a machine-learning model that estimates exactly how easy individuals are to reidentify from an anonymized data set. You can check your own score here, by entering your zip code, gender, and date of birth.
On average, in the US, using those three records, you could be correctly located in an “anonymized” database 81% of the time. Given 15 demographic attributes of someone living in Massachusetts, there’s a 99.98% chance you could find that person in any anonymized database.
“As the information piles up, the chances it isn’t you decrease very quickly,” says Yves-Alexandre de Montjoye, a researcher at Imperial College London and one of the study’s authors.
The tool was created by assembling a database of 210 different data sets from five sources, including the US Census. The researchers fed this data into a machine-learning model, which learned which combinations are more nearly unique and which are less so, and then assigns the probability of correct identification.
This isn’t the first study to show how easy it is to track down individuals from anonymized databases. A paper back in 2007 showed that just a few movie ratings on Netflix can identify a person as easily as a Social Security number, for example. However, it shows just how far current anonymization practices have fallen behind our ability to break them. The fact that the data set is incomplete does not protect people’s privacy, says de Montjoye.
It isn’t all bad news. These same reidentification techniques were used by journalists working at the New York Times earlier this year to expose Donald Trump’s tax returns from 1985 to 1994. However, the same method could be used by someone looking to commit ID fraud or obtain information for blackmail purposes.
“The issue is that we think when data has been anonymized it’s safe. Organizations and companies tell us it’s safe, and this proves it is not,” says de Montjoye.
For peace of mind, companies should be using differential privacy, a complex mathematical model that lets organizations share aggregate data about user habits while protecting an individual’s identity, argues Charlie Cabot, research lead at the privacy engineering firm Privitar.
The technique will get its first major test next year: it’s being used to secure the US Census database.
A chip design that changes everything: 10 Breakthrough Technologies 2023
Computer chip designs are expensive and hard to license. That’s all about to change thanks to the popular open standard known as RISC-V.
Modern data architectures fuel innovation
More diverse data estates require a new strategy—and the infrastructure to support it.
Chinese chips will keep powering your everyday life
The war over advanced semiconductor technology continues, but China will likely take a more important role in manufacturing legacy chips for common devices.
The computer scientist who hunts for costly bugs in crypto code
Programming errors on the blockchain can mean $100 million lost in the blink of an eye. Ronghui Gu and his company CertiK are trying to help.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.