Hello,

We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.

Rewriting Life

How Do Genome Sequencing Centers Store Such Huge Amounts of Data?

If genetic analysis is going to get as cheap as we expect, sequencing centers could become some of the world’s largest users of data storage.

Genomic sequencing has rapidly gone from something possible only at the scale of a national research project to something that can be performed quickly and even cheaply (see “Is It Really Only $1,000 to Sequence a Genome?”). The amount of DNA being analyzed today is staggering—and so are the data storage needs.

Gigabytes

Decoding all six billion bases or letters on the human genome isn’t a straightforward task. Gene-sequencing equipment reads relatively small pieces of DNA at a time and gradually assembles enough overlapping information to build a complete readout of the genome. That initial round of data capture captures vast amounts of raw information, the equivalent of millions of raw images, generating terabytes of data.

In the early days of sequencing, all this raw data was retained, but newer equipment dumps the raw imaging data after processing and generates a compressed file that represents the genome in roughly 100 gigabytes. That file contains significant “oversampling” of the genome—often by a factor of at least 30—to ensure that there is enough reliable information, says Ilya Chorny, a market manager in the enterprise informatics unit of Illumina, a leading maker of gene-sequencing equipment.

A sort of stripped-down précis of about one gigabyte can be used in some cases, but that carries a lower degree of confidence in the accuracy. Michael Schatz, an associate professor of quantitative biology at Cold Spring Harbor Laboratory, says that 100 gigabytes is a good benchmark for projecting the storage requirements of any single human genome over the next decade.

Given the low cost of data storage, it might seem like the rapidly growing need for it shouldn’t be an issue for genomic centers. Consider that a four-terabyte drive designed to be reliable enough for businesses can run as little as $130. Four terabytes is 4,000 gigabytes, or enough to hold 40 genomes, which means each one would use about $3 worth of storage capacity plus a bit extra for redundant offline backup.

But many institutions now generate hundreds of terabytes of data a month and need to store it in a form that’s easily accessible worldwide. Illumina offers one such cloud-storage service, but there is increasing competition. In late 2014, Google Genomics began offering to store genomic data for 2.2 cents per gigabyte per month, which works out to $26 a year for 100 gigabytes. Amazon Web Services also offers genomics services. It doesn’t publish a public price list; its standard storage fees would be about $35 a year for 100 gigabytes.

Future shock

The data demands will get even more intense. While the DNA present in every cell was originally seen as a consistent blueprint for the entire creature, “that’s certainly not true,” Schatz says. Genetic research has found a great deal of variation among different cells in the same person or other organism. That could mean that more than one instance of a person’s genome will need to be stored. This additional data might lend itself to substantial compression, as only the differences between DNA in various cells might need to be stored rather than the genomes as a whole. But compression increases the computational burden when the data needs to be accessed and analyzed; if storage is cheaper than the required computations, keeping the data available in a less efficient manner may make sense.

Schatz and nine colleagues at the University of Illinois at Urbana-Champaign published a paper in July that attempted to get a handle on the upcoming storage requirements for sequencing. As the technology gets better and cheaper, they estimate, somewhere between 100 million and two billion human genomes will be stored by 2025. This growth exceeds the pace of data requirements for other massive and growing storage users—including YouTube in particular, and astronomy as a whole.

Thanks to Nidhan Biswas for this question. If you have one, send it to readerquestions@technologyreview.com

Get stories like this before anyone else with First Look.

Subscribe today
Already a Premium subscriber? Log in.

Uh oh–you've read all of your free articles for this month.

Insider Premium
$179.95/yr US PRICE

More from Rewriting Life

Reprogramming our bodies to make us healthier.

Want more award-winning journalism? Subscribe to Insider Online Only.
  • Insider Online Only {! insider.prices.online !}*

    {! insider.display.menuOptionsLabel !}

    Unlimited online access including articles and video, plus The Download with the top tech stories delivered daily to your inbox.

    See details+

    What's Included

    Unlimited 24/7 access to MIT Technology Review’s website

    The Download: our daily newsletter of what's important in technology and innovation

/
You've read all of your free articles this month. This is your last free article this month. You've read of free articles this month. or  for unlimited online access.