Researchers at Vanderbilt University have created an algorithm designed to protect the privacy of patients while maintaining researchers’ ability to analyze vast amounts of genetic and clinical data to find links between diseases and specific genes or to understand why patients can respond so differently to treatments.
Medical records hold all kinds of information about patients, from age and gender to family medical history and current diagnoses. The increasing availability of electronic medical records makes it easier to group patient files into huge databases where they can be accessed by researchers trying to find associations between genes and medical conditions–an important step on the road to personalized medicine. While the patient records in these databases are “anonymized,” or stripped of identifiers such as name and address, they still contain the numerical codes, known as diagnosis codes or ICD codes, that represent every condition a doctor has detected.
The problem is, it’s not all that difficult to follow a specific set of codes backward and identify a person, says Bradley Malin, an assistant professor of biomedical informatics at Vanderbilt University and one of the algorithm’s coauthors. In a paper published online today in the Proceedings of the National Academy of Sciences, Malin and his colleagues found that they could identify more than 96 percent of a group of patients based solely on their particular sets of diagnosis codes. “When people are asked about privacy priorities, their health data is always right up there with information about their finances,” says Malin–and for good reason. In 2000, computer science researcher Latanya Sweeney cross-referenced voter-registration records with a limited amount of public record information from the Group Insurance Commission (birth date, gender, and zip code) to identify the full medical records of former Massachusetts governor William Weld and his family. In the wrong hands, medical information could lead to blackmail or employment discrimination, or, less critical but still immensely annoying, increases in medical spam. In addition, the hospitals where data were compromised could be liable for negligence, says Malin.
To solve this problem, the Vanderbilt team designed an algorithm that searches a database for combinations of diagnosis codes that distinguish a patient. It then substitutes a more general version of the codes–for instance, postmenopausal osteoporosis could become osteoporosis–to ensure each patient’s altered record is indistinguishable from a certain number of other patients. Researchers could then access this parallel, de-identified database for gene-association studies.
To test their algorithm, the researchers applied it to 2,762 patients, then went back and tried to reconnect the study participants to their diagnostic codes. They were unable to do so. The algorithm also allows researchers to explicitly balance the level of anonymization according to the needs of their research. Ben Reis, an assistant professor at Harvard Medical School who studies personalized, predictive medical systems, says this is an important benefit of the Vanderbilt system.