Geneticists Begin Tests of an Internet for DNA

Scientists are starting to open their DNA databases online, creating a network that could pave the way for gene analysis at a new scale.

Antonio Regaladoarchive page

December 17, 2014

A coalition of geneticists and computer programmers calling itself the Global Alliance for Genomics and Health is developing protocols for exchanging DNA information across the Internet. The researchers hope their work could be as important to medical science as HTTP, the protocol created by Tim Berners-Lee in 1989, was to the Web.

One of the group’s first demonstration projects is a simple search engine that combs through the DNA letters of thousands of human genomes stored at nine locations, including Google’s server farms and the University of Leicester, in the U.K. According to the group, which includes key players in the Human Genome Project, the search engine is the start of a kind of Internet of DNA that may eventually link millions of genomes together.

The technologies being developed are application program interfaces, or APIs, that let different gene databases communicate. Pooling information could speed discoveries about what genes do and help doctors diagnose rare birth defects by matching children with suspected gene mutations to others who are known to have them.

The alliance was conceived two years ago at a meeting in New York of 50 scientists who were concerned that genome data was trapped in private databases, tied down by legal consent agreements with patients, limited by privacy rules, or jealously controlled by scientists to further their own scientific work. It styles itself after the World Wide Web Consortium, or W3C, a body that oversees standards for the Web.

“It’s creating the Internet language to exchange genetic information,” says David Haussler, scientific director of the genome institute at the University of California, Santa Cruz, who is one of the group’s leaders.

The group began releasing software this year. Its hope—as yet largely unrealized—is that any scientist will be able to ask questions about genome data possessed by other laboratories, without running afoul of technical barriers or privacy rules.

The researchers felt they had to act because the falling cost of decoding a genome—then about $10,000, and now already closer to $2,000—was producing a flood of data they were not prepared for. They feared ending up like U.S. hospitals, with electronic systems that are mostly balkanized and unable to communicate.

The way genomic data is siloed is becoming a problem because geneticists need access to ever larger populations. They use DNA information from as many as 100,000 volunteers to search for genes related to schizophrenia, diabetes, and other common disease. Yet even these quantities of data are no longer seen as large enough to drive discovery. “You are going to need millions of genomes,” says David Altshuler, deputy director of the Broad Institute in Cambridge and chairman of the new organization. And no single database is that big.

The Global Alliance thinks the answer is a network that would open the various databases to limited digital searches by other scientists. Using that concept, says Heidi Rehm, a Harvard Medical School geneticist, the alliance is already working on linking together some of the world’s largest databases of information about the breast cancer genes BRCA1 and BRCA2, as well as nine currently isolated databases containing data about genes that cause rare childhood diseases.

In March, the group launched a test of whether scientific organizations would be willing to share data. A product called Beacon lets the owner of a database open it up for strictly limited searches.

“We are not trying to invent a technical feat; it’s breaking down this problem of people not sharing data,” says Marc Fiume, a computer science graduate student at the University of Toronto who built part of the interface. “This lets you probe, but without identifying anyone or violating patient privacy.”

So far, 15 databases are compatible with Beacon, which Fiume rates a reasonable success. Three are stores of public genomes that Google maintains a copy of, and one is at a software company called Curoverse in Boston (see “Google Wants to Store Your Genome”).

Haussler says a future protocol would offer access to progressively more data, but in a controlled way. Scientists would have to register, or even sign legal agreements. “If it’s ‘Give me the whole genome,’ you’d enter a contract for that,” he says.

One change the alliance is pushing is a new type of master consent form, the document that lays out volunteers’ rights when they hand over their genomes. The new consent is broader than most, giving permission for “controlled access” by “researchers around the world.” It promises that no researcher will identify a participant, although since DNA is a unique, like a fingerprint, there would be no guarantees.

Like the W3C, the Global Alliance has “host institutions” that pay its bills. So far, they are the Broad Institute, the Wellcome Trust Sanger Institute in the U.K., and the Ontario Institute for Cancer Research, according to Altshuler, who declined to say how much money each had contributed.

John Wilbanks, chief commons officer of the nonprofit Sage Bionetworks, is working with the alliance and is also a former member of the W3C. He says the alliance has a harder task than the W3C did. “The Web existed long before the Web Consortium did. That is the big difference,” he says. “The Web got traction, and the consortium was created to manage it. They didn’t have to create the Web.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.