Breaking the Genome Bottleneck

It can be quicker and easier to sequence a genome than to analyze the resulting data—now one startup thinks it has a solution to this data-crunching bottleneck.

Susan Young Rojahnarchive page

May 7, 2012

The genomic data generated from next-generation sequencing machines doesn’t amount to much more than alphabet soup if it’s not subjected to significant computational processing and statistical analysis. For the data to be useful, the trick is to turn those As, Ts, Gs, and Cs into a manageable description of disease risks and other genetic predispositions. That requires a lot of computational power and time—already a significant bottleneck for some genomic analysis companies (see Bases to Bytes, May/June, 2012).

Several companies are looking to the cloud as a way to help them analyze all the data. The idea is that researchers can send their data to a Web-hosted analysis service that will process raw data into a genetic profile. However, the data files generated by sequencing machines are so massive (see “Bases to Bytes,” March/April 2012) that the mundane issue of uploading large files to the cloud becomes its own issue. The strategy of a Redwood City, California-based startup called Bina Technologies is to divide and conquer: give customers an in-house data-crunching machine that will turn a mountain of raw sequence into easily shared genetic profiles. Those profiles can then be quickly uploaded to Bina Technologies’ cloud-hosted site for data management, sharing, and aggregation.

The group plans to sell its so-called “Bina Box” preloaded with software that can reduce the 300 gigabytes or so of raw data from a human genome into a few hundred megabytes of genetic information. The box will upload the compressed dataset to Bina’s cloud service for storage, sharing, and further analysis. The Bina Box can do the initial heavy lifting and make the data small enough to send to the cloud, says Narges Bani Asadi, founder of Bina.

Bina Technologies says its system does this initial processing of genomic data at speeds that are orders of magnitude faster than tools made available by the Broad Institute, the MIT-Harvard joint genome center. What takes about a week using the Broad’s genome variation analysis pipeline on a high-end eight-core machine on Amazon’s cloud can be done in about two hours on a Bina Box, says Asadi. The company expects to publish a full description of its comparison to other analysis pipelines in the coming months.

Bina Technologies plans to work with a few genomics groups as part of a pilot test phase for its system. One group in early conversation with Bina Technologies is Foundation Medicine, a Cambridge, Massachusetts, cancer genomics company (see “Foundation Medicine: Personalizing Cancer Drugs,” March/April 2012). While the team responsible for prepping samples and generating raw sequence data has been able to scale up its processes to meet demand, the same is not true for the computational analysis, says Maureen Cronin, senior vice president of research collaborations with Foundation Medicine, and an advisor to Bina Technologies. She says all the data streaming off Foundation Medicine’s sequencing machines has “created quite a computational problem.”

To be certain of the mutations they identify, says Cronin, Foundation Medicine sequences a patient’s genome at an average of 500X coverage—that is, every one of the three billion base pairs in the human genome is replicated about 500 times. This raw data, billions of short blips of the genome, each a few dozen base pairs in length, must then be processed into longer chromosomal sequences. This “assembly” process is followed by a comparison of an individual’s genome to a standard of reference—the result of the human genome project. All this must happen before any clinical interpretation of a tumor or other genome can even begin. “It’s an incredibly computationally intensive process,” says Cronin.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.