Google Wants to Store Your Genome

For $25 a year, Google will keep a copy of any genome in the cloud.

Antonio Regaladoarchive page

November 6, 2014

Google is approaching hospitals and universities with a new pitch. Have genomes? Store them with us.

The search giant’s first product for the DNA age is Google Genomics, a cloud computing service that it launched last March but went mostly unnoticed amid a barrage of high profile R&D announcements from Google, like one late last month about a far-fetched plan to battle cancer with nanoparticles (see “Can Google Use Nanoparticles to Search for Cancer?”).

Google Genomics could prove more significant than any of these moonshots. Connecting and comparing genomes by the thousands, and soon by the millions, is what’s going to propel medical discoveries for the next decade. The question of who will store the data is already a point of growing competition between Amazon, Google, IBM, and Microsoft.

Google began work on Google Genomics 18 months ago, meeting with scientists and building an interface, or API, that lets them move DNA data into its server farms and do experiments there using the same database technology that indexes the Web and tracks billions of Internet users.

“We saw biologists moving from studying one genome at a time to studying millions,” says David Glazer, the software engineer who led the effort and was previously head of platform engineering for Google+, the social network. “The opportunity is how to apply breakthroughs in data technology to help with this transition.”

Some scientists scoff that genome data remains too complex for Google to help with. But others see a big shift coming. When Atul Butte, a bioinformatics expert at Stanford heard Google present its plans this year, he remarked that he now understood “how travel agents felt when they saw Expedia.”

The explosion of data is happening as labs adopt new, even faster equipment for decoding DNA. For instance, the Broad Institute in Cambridge, Massachusetts, said that during the month of October it decoded the equivalent of one human genome every 32 minutes. That translated to about 200 terabytes of raw data.

This flow of data is smaller than what is routinely handled by large Internet companies (over two months, Broad will produce the equivalent of what gets uploaded to YouTube in one day) but it exceeds anything biologists have dealt with. That’s now prompting a wide effort to store and access data at central locations, often commercial ones. The National Cancer Institute said last month that it would pay $19 million to move copies of the 2.6 petabyte Cancer Genome Atlas into the cloud. Copies of the data, from several thousand cancer patients, will reside both at Google Genomics and in Amazon’s data centers.

The idea is to create “cancer genome clouds” where scientists can share information and quickly run virtual experiments as easily as a Web search, says Sheila Reynolds, a research scientist at the Institute for Systems Biology in Seattle. “Not everyone has the ability to download a petabyte of data, or has the computing power to work on it,” she says.

Also speeding the move of DNA data to the cloud has been a yearlong price war between Google and Amazon. Google says it now charges about $25 a year to store a genome, and more to do computations on it. Scientific raw data representing a single person’s genome is about 100 gigabytes in size, although a polished version of a person’s genetic code is far smaller, less than a gigabyte. That would cost only $0.25 cents a year.

Cloud storage is giving a boost to startups like Tute Genomics, DNANexus, Seven Bridges, and NextCode Health. These companies build “browsers” that hospitals and scientists can use to explore genetic data. “Google or Amazon is a back end. They are saying, ‘Hey, you can build a genomics company in our cloud,’” says Deniz Kural, CEO of Seven Bridges, which stores genome data on behalf of 1,600 researchers in Amazon’s cloud.

The bigger point, he says, is that medicine will soon rely on a kind of global Internet-of-DNA which doctors will be able to search. “Our bird’s eye view is that if I were to get lung cancer in the future, doctors are going to sequence my genome and my tumor’s genome, and then query them against a database of 50 million other genomes,” he says. “The result will be ‘Hey, here’s the drug that will work best for you.’ ”

At Google, Glazer says he began working on Google Genomics as it became clear that biology was going to move from “artisanal to factory-scale data production.” He started by teaching himself genetics, taking an online class, Introduction to Biology, taught by Broad’s chief, Eric Lander. He also got his genome sequenced and put it on Google’s cloud.

Glazer wouldn’t say how large Google Genomics is or how many customers it has now, but at least 3,500 genomes from public projects are already stored on Google’s servers. He also says there’s no link, as of yet, between Google’s cloud and its more speculative efforts in health care, like the company Google started this year, called Calico, to investigate how to extend human lifespans. “What connects them is just a growing realization that technology can advance the state of the art in life sciences,” says Glazer.

Somalee Datta, a physicist who manages Stanford University’s largest computer cluster for genetics data, says that because of recent price cuts, it now costs about the same to store genomes with Google or Amazon as in her own data center. “Prices are finally becoming reasonable, and we think they will keep dropping,” she says.

Datta says some Stanford scientists have started using a Google database system, BigQuery, that Glazer’s team made compatible with genome data. It was developed to analyze large databases of spam, web documents, or of consumer purchases. But it can also quickly perform the very large experiments comparing thousands, or tens of thousands, of people’s genomes that researchers want to try. “Sometimes they want to do crazy things, and you need scale to do that,” says Datta. “It can handle the scale genetics can bring, so it’s the right technology for a new problem.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.