Researchers at AT&T, Rutgers University, Princeton, and Loyola University have devised a way to mine cell-phone data without revealing your identity, potentially showing a route to avoiding privacy pitfalls that have so far confined global cell-phone data-mining work to research labs.
Working with billions of location data points from AT&T mobile phone calls and text-messages around Los Angeles and New York City, they’ve built a “mobility model” of the two regions that aggregates the data, produces representative “synthetic call records”—then mathematically obscures any data that could tend to identify people.
The model can do things like rapidly predict how a new development or telecommuting policy would affect overall transportation, or it could be a new tool for planning at a town level where little mobility data is available, says Margaret Martonosi, a computer scientist at Princeton who is working on the model. Right now, planners generally rely on road sensors and the limited number of people who permit their GPS position to be captured.
Vincent Blondel, a computer scientist at the Université Catholique de Louvain, Belgium, and a leader in research efforts on call data records and privacy issues, says the work was impressive. “This is an excellent work that will help explore ways of making the best use of important data in a privacy protective way,” he says.
Even the simplest phone leaves behind extensive digital traces—called call detail records, or CDRs—that are preserved by mobile carriers. These records—on the time a voice call or text message was placed, and the identity and location of the cell tower involved—give the approximate locations of the phone’s owner. Over time, they can be used to develop an accurate trace of the user’s movements.
In aggregate—but mostly in theory so far—this data can be used to guide epidemiology research, or to unsnarl traffic by giving an unprecedented view on all human movement patterns (see “How Wireless Carriers Are Monetizing Your Movements”). It can also guide development efforts in poorer parts of the world (see “Big Data from Cheap Phones”).
But building in guaranteed privacy protections represents the toughest hurdle to the growing number of research efforts that tap CDRs. Even if such records are stripped of names and numbers, the identity of the person can often be revealed through other means. For example, a single cell-tower ping at 4:12 a.m. could be connected to a public tweet made at 4:12 a.m. that includes the location and identity of the tweeter. Similar risks crop up for data belonging to people who live in a remote area or have unusual home-work commuting patterns.
The new approach starts by aggregating traces of real human movements, then identifying common locations that might indicate home, work, or school. Next, it creates a set of transportation models. These models generate route tracks of people that the researchers call “synthetic,” because they are merely representative of the aggregate data, and not of actual people.
But the third part is the key. Even these supposedly synthetic records can closely match real ones (especially when the underlying aggregate sample is small). So an algorithm, using an emerging technique known as differential privacy, calculates exactly how high this risk is, and how to reduce it by altering the data. “Noise is injected into the model at points in order to reduce the likelihood of individuals being identifiable,” says Martonosi.
Injecting noise includes deliberately altering the aggregated home and work locations to reduce the reliance on any one individual’s data. Likewise, the aggregated call times are changed to mask any individual’s contribution. Taken together, such tweaks to the data would throw off any efforts to align databases.
Part of this new mobility modeling work was first presented at a conference last year, but refinements and the differential privacy variant were presented last week at a conference at MIT. At the same conference, IBM researchers showed how call records could help optimize public transportation routes (see “African Bus Routes Redrawn Using Cell Phone Data”).
Martonosi says that publicly releasing the mobility models she and her colleagues have built of New York and Los Angeles metro areas won’t happen before additional publications finalize the work and prove the privacy approach, since the models indirectly draw from real user data.
In the meantime, the methods she and her colleagues used to build the model are publicly published. So other groups could build similar models for other metro areas if they have their own call data records to work with, she says. AT&T collaborated on the research, which was done at an AT&T facility on three months’ worth of customer data from 300,000 of the carrier’s customers each in the New York and Los Angeles areas. AT&T declined to comment for this story.
Amid surging research interest in mobile data, the groups’ approach is garnering considerable interest. William Hoffman, who heads the World Economic Forum’s data-based development efforts, says the approach showed promise. “I thought the concept was quite interesting as a means of ‘de-risking’ the ability of researchers to explore the data,” he says. “It’s one of multiple steps data holders can pursue to strike the balance of using data while protecting the individual.”
One key question is whether a system of synthetic data records could get the carriers around the delicate matter of obtaining user consent. “That’s one of the big issues I took away from the [recent MIT] conference,” says Hoffman. The answer might depend how the data was used or sold, he says.
Nicolas Decordes, a vice president at Orange, the European carrier, says that the company’s R&D team said the techniques “would be feasible and could be helpful” for transportation modeling. Because the method does not use real-time data, however, it is better for planning and cannot guide response to events.
The process of obtaining and using cell-phone data is already very touchy. When Orange released data from Ivory Coast to researchers last year, a process Decordes oversaw, that nation was chosen because its Information and Communications Technology (ICT) ministry hadn’t signed on to a regulatory framework restricting such use, in contrast to nearby African nations. And even so, Orange required researchers to sign agreements barring them from trying to identify individuals.
Linus Bengtsson, an epidemiologist at Sweden’s Karolinska Institute and a founder of Flowminder, which provides mobility data to NGOs and relief agencies, says that however advanced the privacy protections get, the research community will always need codes of conduct to protect privacy. “Researchers in many areas analyze datasets where someone—with enough determination—could be able to identify people,” he says. “I think [developing] rules for this is actually a more important point than the difficult task of creating special anonymized data sets.”
Other recent research results included ones that show how call records can be used to follow soccer fans as they leave a match or even map poverty levels inside a country, if airtime purchasing habits are analyzed (see “Glimpses of a World Revealed by Cell-Phone Data”).