Genomic sequencing has rapidly gone from something possible only at the scale of a national research project to something that can be performed quickly and even cheaply (see “Is It Really Only $1,000 to Sequence a Genome?”). The amount of DNA being analyzed today is staggering—and so are the data storage needs.
Decoding all six billion bases or letters on the human genome isn’t a straightforward task. Gene-sequencing equipment reads relatively small pieces of DNA at a time and gradually assembles enough overlapping information to build a complete readout of the genome. That initial round of data capture captures vast amounts of raw information, the equivalent of millions of raw images, generating terabytes of data.
In the early days of sequencing, all this raw data was retained, but newer equipment dumps the raw imaging data after processing and generates a compressed file that represents the genome in roughly 100 gigabytes. That file contains significant “oversampling” of the genome—often by a factor of at least 30—to ensure that there is enough reliable information, says Ilya Chorny, a market manager in the enterprise informatics unit of Illumina, a leading maker of gene-sequencing equipment.
A sort of stripped-down précis of about one gigabyte can be used in some cases, but that carries a lower degree of confidence in the accuracy. Michael Schatz, an associate professor of quantitative biology at Cold Spring Harbor Laboratory, says that 100 gigabytes is a good benchmark for projecting the storage requirements of any single human genome over the next decade.
Given the low cost of data storage, it might seem like the rapidly growing need for it shouldn’t be an issue for genomic centers. Consider that a four-terabyte drive designed to be reliable enough for businesses can run as little as $130. Four terabytes is 4,000 gigabytes, or enough to hold 40 genomes, which means each one would use about $3 worth of storage capacity plus a bit extra for redundant offline backup.
But many institutions now generate hundreds of terabytes of data a month and need to store it in a form that’s easily accessible worldwide. Illumina offers one such cloud-storage service, but there is increasing competition. In late 2014, Google Genomics began offering to store genomic data for 2.2 cents per gigabyte per month, which works out to $26 a year for 100 gigabytes. Amazon Web Services also offers genomics services. It doesn’t publish a public price list; its standard storage fees would be about $35 a year for 100 gigabytes.
The data demands will get even more intense. While the DNA present in every cell was originally seen as a consistent blueprint for the entire creature, “that’s certainly not true,” Schatz says. Genetic research has found a great deal of variation among different cells in the same person or other organism. That could mean that more than one instance of a person’s genome will need to be stored. This additional data might lend itself to substantial compression, as only the differences between DNA in various cells might need to be stored rather than the genomes as a whole. But compression increases the computational burden when the data needs to be accessed and analyzed; if storage is cheaper than the required computations, keeping the data available in a less efficient manner may make sense.
Schatz and nine colleagues at the University of Illinois at Urbana-Champaign published a paper in July that attempted to get a handle on the upcoming storage requirements for sequencing. As the technology gets better and cheaper, they estimate, somewhere between 100 million and two billion human genomes will be stored by 2025. This growth exceeds the pace of data requirements for other massive and growing storage users—including YouTube in particular, and astronomy as a whole.
Thanks to Nidhan Biswas for this question. If you have one, send it to firstname.lastname@example.org