Shannon’s Mathematical Theory of Communication Applied to DNA Sequencing

Nobody knows which sequencing technology is fastest because there has never been a fair way to compare the rate at which they extract information from DNA. Until now.

Emerging Technology from the arXivarchive page

April 2, 2012

One of the great unsung heroes of 20th-century science is Claude Shannon, an engineer at the famous Bell Laboratories during its heyday in the mid-20th century. Shannon’s most enduring contribution to science is information theory, which underpins all digital communication.

In a famous paper dating from the late 1940s, Shannon set out the fundamental problem of communication: to reproduce, at one point in space, a message that has been created at another. The message is first encoded in some way, transmitted, and then decoded.

Shannon’s showed that a message can always be reproduced at another point in space with arbitrary precision provided noise is below some threshold level. He went on to work out how much information could be sent in this way, a property known as the capacity of this information channel.

Shannon’s ideas have been applied widely to all forms of information transmission with much success. One particularly interesting avenue has been the application of information theory to biology–the idea that life itself is the transmission of information from one generation to the next.

That type of thinking is ongoing, revolutionary, and still in its early stages. There’s much to come.

Today, we look at an interesting corollary in the area of biological information transmission. Abolfazl Motahari and pals at the University of California, Berkeley, use Shannon’s approach to examine how rapidly information can be extracted from DNA using the process of shotgun sequencing.

The problem here is to determine the sequence of nucleotides (A, G, C, and T) in a genome. That’s time-consuming because genomes tend to be long–for instance, the human genome consists of some 3 billion nucleotides or base pairs. This would take forever to sequence in series.

So the shotgun approach involves cutting the genome into random pieces, consisting of between 100 and 1,000 base pairs, and sequencing them in parallel. The information is then glued back together in silico by a so-called reassembly algorithm.

Of course, there’s no way of knowing how to reassemble the information from a single “read” of the genome. So in the shotgun approach, this process is repeated many times. Because each read divides up the genome in a different way, pieces inevitably overlap with segments from a previous run. These areas of overlap make it possible to reassemble the entire genome, like a jigsaw puzzle.

That smells like a classic problem of information theory, and indeed various people have thought about in this way. However, Motahari and co go a step further by restating it more or less exactly as an analogue of Shannon’s famous approach.

They say the problem of genome sequencing is essentially of reproducing a message written in DNA, in a digital electronic format. In this approach, the original message is in DNA, it is encoded for transmission by the process of reading, and then it is decoded by a reassembly algorithm to produce an electronic version.

What they prove is that there is a channel capacity that defines a maximum rate of information flow during the process of sequencing. “It gives the maximum number of DNA base pairs that can be resolved per read, by any assembly algorithm, without regard to computational limitations,” they say.

That is a significant result for anybody interested in sequencing genomes. An important question is how quickly any particular sequencing technology can do its job and whether it is faster or slower than other approaches.

That’s not possible to work out at the moment because many of the algorithms used for assembly are designed for specific technologies and approaches to reading. Motohari and co say there are at least 20 different reassembly algorithms, for example. “This makes it difficult to compare different algorithms,” they say.

Consequently, nobody really knows which is quickest or even which has the potential to be quickest.

The new work changes this. For the first time, it should be possible to work how close a given sequencing technology gets to the theoretical limit.

That could well force a clear-out-dead-wood from this area and stimulate a period of rapid innovation in sequencing technology.

Ref: arxiv.org/abs/1203.6233: Information Theory of DNA Sequencing

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.