Keeping Medical Data Private

Algorithm protects patients’ personal information while preserving the data’s utility in large-scale medical studies.

Katharine Gammonarchive page

April 12, 2010

Researchers at Vanderbilt University have created an algorithm designed to protect the privacy of patients while maintaining researchers’ ability to analyze vast amounts of genetic and clinical data to find links between diseases and specific genes or to understand why patients can respond so differently to treatments.

Medical records hold all kinds of information about patients, from age and gender to family medical history and current diagnoses. The increasing availability of electronic medical records makes it easier to group patient files into huge databases where they can be accessed by researchers trying to find associations between genes and medical conditions–an important step on the road to personalized medicine. While the patient records in these databases are “anonymized,” or stripped of identifiers such as name and address, they still contain the numerical codes, known as diagnosis codes or ICD codes, that represent every condition a doctor has detected.

The problem is, it’s not all that difficult to follow a specific set of codes backward and identify a person, says Bradley Malin, an assistant professor of biomedical informatics at Vanderbilt University and one of the algorithm’s coauthors. In a paper published online today in the Proceedings of the National Academy of Sciences, Malin and his colleagues found that they could identify more than 96 percent of a group of patients based solely on their particular sets of diagnosis codes. “When people are asked about privacy priorities, their health data is always right up there with information about their finances,” says Malin–and for good reason. In 2000, computer science researcher Latanya Sweeney cross-referenced voter-registration records with a limited amount of public record information from the Group Insurance Commission (birth date, gender, and zip code) to identify the full medical records of former Massachusetts governor William Weld and his family. In the wrong hands, medical information could lead to blackmail or employment discrimination, or, less critical but still immensely annoying, increases in medical spam. In addition, the hospitals where data were compromised could be liable for negligence, says Malin.

To solve this problem, the Vanderbilt team designed an algorithm that searches a database for combinations of diagnosis codes that distinguish a patient. It then substitutes a more general version of the codes–for instance, postmenopausal osteoporosis could become osteoporosis–to ensure each patient’s altered record is indistinguishable from a certain number of other patients. Researchers could then access this parallel, de-identified database for gene-association studies.

To test their algorithm, the researchers applied it to 2,762 patients, then went back and tried to reconnect the study participants to their diagnostic codes. They were unable to do so. The algorithm also allows researchers to explicitly balance the level of anonymization according to the needs of their research. Ben Reis, an assistant professor at Harvard Medical School who studies personalized, predictive medical systems, says this is an important benefit of the Vanderbilt system.

An inherent tension lies between using medical records for legitimate clinical research and concerns about patient privacy. “The problem is, stuff that’s considered anonymous really isn’t,” says Michael Swiernik, director of medical informatics at the University of California, Los Angeles. “It’s going to take a lot of different creative approaches to protect people, and this algorithm is one tool in that box.”

The new approach has its limitations. The studies work best, say the researchers, when they start out with a specific hypothesis or goal–say, to study the prevalence of asthma in teenagers with allergies. However, if they wanted to use the same data to examine associations between two random health issues in the future, it would be more difficult.

The researchers want to combine their clinical-code-protecting algorithm with other security mechanisms already in place, like protections for demographic information, to keep patient data as safe as possible. They also want to reach out to use more data outside of Vanderbilt, according to Grigorios Loukides, the study’s lead author.

The future of science relies on more subtle ways of extracting useful information from existing data. Methods that allow researchers to be more nuanced in how they anonymize data “enable us to maximize the scientific benefit we get from population data while controlling the risks to privacy,” according to Isaac Kohane, director of the Boston Children’s Hospital Informatics Program. It’s all about sharing, says study author Malin. “Generating data is expensive, and it’s both good science and good etiquette to reuse data. The challenge is to do it while protecting people.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.