MIT researchers have found that the dates and locations of four purchases are enough to identify 90 percent of the people in a data set recording three months’ worth of credit card transactions by 1.1 million users.
When the researchers also considered coarse-grained information about the prices of purchases, just three data points were enough to identify an even larger percentage of people in the data set. That means that someone with copies of just three of your recent receipts—or one receipt, one Instagram photo of you having coffee with friends, and one tweet about the phone you just bought—would have a 94 percent chance of extracting your credit card records from those of a million other people. This is true, the researchers say, even in cases where no one in the data set is identified by name, address, credit card number, or anything else that we typically think of as personal information.
The data set the researchers analyzed included the names and locations of the shops at which purchases took place, the dates on which they took place, and the purchase amounts. Purchases made with the same credit card were all tagged with the same random identification number.
For each identification number—each customer in the data set—the researchers selected purchases at random, then determined how many other customers’ purchase histories contained the same data points. They varied the number of data points per customer over a range from two to five. Without price information, two data points were still sufficient to identify more than 40 percent of the people in the data set. At the other extreme, five points with price information were enough to identify almost everyone.
Preserving anonymity in large data sets is a pressing concern because public and private entities alike see aggregated digital data as a source of novel insights. Retailers studying anonymized credit card histories could certainly learn something about the tastes of their customers, but economists might also learn something about the relationship of, say, inflation or consumer spending to other economic factors.
Lead researcher Yves-Alexandre de Montjoye is a grad student in Sandy Pentland’s Human Dynamics Laboratory at the Media Lab. “Sandy and I do really believe that this data has great potential and should be used,” he says. “We, however, need to be aware and account for the risks of reidentification.”
Toronto wants to kill the smart city forever
The city wants to get right what Sidewalk Labs got so wrong.
Saudi Arabia plans to spend $1 billion a year discovering treatments to slow aging
The oil kingdom fears that its population is aging at an accelerated rate and hopes to test drugs to reverse the problem. First up might be the diabetes drug metformin.
Yann LeCun has a bold new vision for the future of AI
One of the godfathers of deep learning pulls together old ideas to sketch out a fresh path for AI, but raises as many questions as he answers.
The dark secret behind those cute AI-generated animal images
Google Brain has revealed its own image-making AI, called Imagen. But don't expect to see anything that isn't wholesome.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.