Giant stockpiles of personal data, whether Web browsing logs, credit-card purchases, or the information shared through social networks, are becoming increasingly valuable assets for businesses. Such data can be analyzed to determine trends that guide business strategy, or sold to other businesses for a tidy profit. But as your personal data is analyzed and handed around, the risk increases that it could be traced back to you, presenting an unwelcome invasion of privacy.
A new mathematical technique developed at Cornell University could offer a way for large data sets of personal data to be shared and analyzed while guaranteeing that no individual’s privacy will be compromised.
“We want to make it possible for Facebook or the U.S. Census Bureau to analyze sensitive data without leaking information about individuals,” says Michael Hay, an assistant professor at Colgate University, who created the technique while a research fellow at Cornell, with colleagues Johannes Gehrke, Edward Lui, and Rafael Pass. “We also have this other goal of utility; we want the analyst to learn something.”
Companies often do attempt to mitigate the risk that the personal data they hold could be used to identify individuals, but these measures aren’t always effective. Both Netflix and AOL discovered this when they released supposedly “anonymized” data so that anyone could analyze it. Researchers showed that both data sets could be de-anonymized by cross referencing them with data available elsewhere.
“In practice, people are using fairly ad-hoc techniques” to protect the privacy of users included in these data sets, says Hay. These techniques include stripping out names and social security numbers, or other data points. “People have crossed their fingers that they are providing true protection,” says Hay, who adds that data mavens at some government agencies fear lawsuits could be filed over improperly protecting data for privacy. “I know in talking with other people at statistical agencies where they said we’re worried about being sued for privacy violations.”
In recent years, many researchers have worked on ways to mathematically guarantee privacy. However, the most promising approach, known as differential privacy, has proven challenging to implement, and it typically requires adding noise to a data set, which makes that data set less useful.
The Cornell group proposes an alternative approach called crowd-blending privacy. It involves limiting how a data set can be analyzed to ensure that any individual record is indistinguishable from a sizeable crowd of other records—and removing a record from the analysis if this cannot be guaranteed.
Noise does not need to be added to a data set, and when a data set analyzed is a sample of a larger one, the group showed that crowd-blending comes close to matching the statistical strength of differential privacy. “The hope is that because crowd-blending is a less strict privacy standard it will be possible to write algorithms that will satisfy it,” says Hay, “and it could open up new uses for data.”
The new technique “provides an interesting and potentially very useful alternative privacy definition,” says Elaine Shi, an assistant professor at the University of Maryland, College Park, who is also researching ways to protect privacy in data sets. “In comparison with differential privacy, crowd-blending privacy can sometimes allow one to achieve much better utility, by introducing less or no noise.”
Shi adds that research into guaranteeing privacy should eventually make it possible to take responsibility for protecting users’ data out of the hands of software developers and their managers. “The underlying system architecture itself [would] enforce privacy—even when code supplied by the application developers may be untrusted,” she says. Shi’s research group is working on a cloud-computing system along those lines. It hosts sensitive personal data and allows access, but also carefully monitors the software that makes use of it.
Benjamin Fung, an associate professor at Concordia University, says crowd-blending is a useful idea, but believes that the differential privacy may still prove feasible. His group worked with a Montreal transportation company to implement a version of differential privacy for a data set of geolocation traces. Fung suggests that research in this area needs to move on to implementation, so crowd-blending and other approaches can be directly compared—and eventually put into practice.
Hay agrees that it’s time for the discussion to move on to implementation. But he also points out that privacy protections won’t prevent other practices that some people may find distasteful. “You can satisfy constraints like this and still learn predictive correlations,” he points out, which might result, for example, in auto insurance premiums being set based on information about a person seemingly unrelated to their driving. “As privacy guaranteeing techniques are adopted, it could be that other concerns emerge.”