Study Highlights the Risk of Handing Over Your Genome

Researchers found they could tie people’s identities to supposedly anonymous genetic data by cross-referencing it with information available online.

Susan Young Rojahnarchive page

January 17, 2013

If you contribute your genome sequence anonymously to a scientific study, that data might still be linked back to you, according to a study published today in the journal Science. The researchers behind the study found they could deanonymize genomic data using only publicly available Internet information and some clever detective work.

The study points to rising issues concerning genetic privacy and the need for better legal protection against genetic discrimination, experts say, since such a technique could reveal a person’s propensity to a particular disease. The work also shows that study participants need to be better educated about the risks of joining genetic research efforts.

Open-access data sets of human genomic information are an important resource for researchers trying to uncover the genetic basis of human disease. The 1000 Genomes Project, for example, is a publicly available catalog of variation in humans that researchers can use to identify mutations that cause disease risk in certain populations (see “The Future of the Human Genome”). Researchers use this kind of open database much more often than controlled access sources, the National Institutes of Health said in a response to today’s findings that was also published in Science.

“Our last intention is to push these resources behind some firewall, says Yaniv Erlich, a geneticist at the Whitehead Institute for Biomedical Research and senior author on today’s study. “We are in favor of public data sharing, but we need to think about how it could be misused and describe that correctly to people.”

While the Genetic Information Nondiscrimination Act of 2008 offers people some protection against employers or health insurers discriminating against them based on their genetics, life insurers and disability insurers are not prevented from using such information in their decisions.

“We have no comprehensive genetic privacy law,” says Jeremy Gruber, a lawyer and president of the Council for Responsible Genetics. “People need to be much better informed of the lack of privacy protections we have for genetic information,” says Gruber.

In the long run, says Erlich, it is better for these potential breaches to be demonstrated by a friendly investigator rather than someone who really wants to exploit the data. “That would really undermine the public trust,” he says.

This isn’t the first time privacy risks have been highlighted for public genome databases. Different groups have shown that with a second DNA sample, an individual’s genetic information could be pulled out of what was thought to be anonymous “pooled” genomic data or gene activity databases. But Erlich’s team used only knowledge of genetic markers and Internet detective work to identify nearly 50 people in public genomic data sets.

Erlich, a former computer security researcher, was once hired by banks and other businesses to test their computer systems. For the DNA sleuthing, Erlich and his team used free genealogical databases that link surnames with genetic markers, called short tandem repeats, on the Y chromosome. There is no known biological function for these repeats, but the length and number are commonly used in ancestry research because, like surnames, those patterns are typically passed from father to son.

Once the team found a link between the Y chromosome repeats in the genomic databases and potential surnames, they used other pieces of demographic information, such as date and place of birth, which are included in some of the genomic databases, and public records to identify donors.

Eric Green, director of the National Human Genome Research Institute, and other employees of the NIH acknowledge that Erlich’s study highlights vulnerabilities in these research projects. To mitigate future risks, they write in the response published in Science, the NIH has decided to “shift age information, which had been available for some of the participants on the repository’s public Web site, into controlled-access portions of the resource.”

In addition to recruiting people who think the societal and medical research benefits of participating in genomic research outweigh the risks, better legal protection is key, says George Church, a geneticist at Harvard Medical School and founder of the Personal Genome Project, an open-access database of genomic and health data. While there may be ways to make the data more secure, “for every lock there is going to be a countermeasure, and I think that’s a game that’s just not worth playing,” says Church. “Much better is coming up with a protocol where you don’t need any locks,” he says, which would include better legal protection and education for study participants.

Today’s findings emphasize the need for public representation in the oversight of data collection, says Wylie Burke, a clinical geneticist at the University of Washington in Seattle. “Information should be readily available to the public concerning the oversight procedures in place, the research purposes for which data are being used, the outcomes of data uses, and, of course, how any misuses of data have been handled,” she says. “Without this kind of approach, we could see increasing mistrust of the research process.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.