Skip to Content

Shannon’s Mathematical Theory of Communication Applied to DNA Sequencing

Nobody knows which sequencing technology is fastest because there has never been a fair way to compare the rate at which they extract information from DNA. Until now.

One of the great unsung heroes of 20th-century science is Claude Shannon, an engineer at the famous Bell Laboratories during its heyday in the mid-20th century. Shannon’s most enduring contribution to science is information theory, which underpins all digital communication. 

In a famous paper dating from the late 1940s, Shannon set out the fundamental problem of communication: to reproduce, at one point in space, a message that has been created at another. The message is first encoded in some way, transmitted, and then decoded.

Shannon’s showed that a message can always be reproduced at another point in space with arbitrary precision provided noise is below some threshold level. He went on to work out how much information could be sent in this way, a property known as the capacity of this information channel.

Shannon’s ideas have been applied widely to all forms of information transmission with much success. One particularly interesting avenue has been the application of information theory to biology–the idea that life itself is the transmission of information from one generation to the next. 

That type of thinking is ongoing, revolutionary, and still in its early stages. There’s much to come. 

Today, we look at an interesting corollary in the area of biological information transmission. Abolfazl Motahari and pals at the University of California, Berkeley, use Shannon’s approach to examine how rapidly information can be extracted from DNA using the process of shotgun sequencing.

The problem here is to determine the sequence of nucleotides (A, G, C, and T) in a genome. That’s time-consuming because genomes tend to be long–for instance, the human genome consists of some 3 billion nucleotides or base pairs. This would take forever to sequence in series.

So the shotgun approach involves cutting the genome into random pieces, consisting of between 100 and 1,000 base pairs, and sequencing them in parallel. The information is then glued back together in silico by a so-called reassembly algorithm.

Of course, there’s no way of knowing how to reassemble the information from a single “read” of the genome. So in the shotgun approach, this process is repeated many times. Because each read divides up the genome in a different way, pieces inevitably overlap with segments from a previous run. These areas of overlap make it possible to reassemble the entire genome, like a jigsaw puzzle. 

That smells like a classic problem of information theory, and indeed various people have thought about in this way. However, Motahari and co go a step further by restating it more or less exactly as an analogue of Shannon’s famous approach.

They say the problem of genome sequencing is essentially of reproducing a message written in DNA, in a digital electronic format. In this approach, the original message is in DNA, it is encoded for transmission by the process of reading, and then it is decoded by a reassembly algorithm to produce an electronic version.  

What they prove is that there is a channel capacity that defines a maximum rate of information flow during the process of sequencing. “It gives the maximum number of DNA base pairs that can be resolved per read, by any assembly algorithm, without regard to computational limitations,” they say.

That is a significant result for anybody interested in sequencing genomes. An important question is how quickly any particular sequencing technology can do its job and whether it is faster or slower than other approaches. 

That’s not possible to work out at the moment because many of the algorithms used for assembly are designed for specific technologies and approaches to reading. Motohari and co say there are at least 20 different reassembly algorithms, for example. “This makes it difficult to compare different algorithms,” they say. 

Consequently, nobody really knows which is quickest or even which has the potential to be quickest. 

The new work changes this. For the first time, it should be possible to work how close a given sequencing technology gets to the theoretical limit. 

That could well force a clear-out-dead-wood from this area and stimulate a period of rapid innovation in sequencing technology.

Ref: arxiv.org/abs/1203.6233: Information Theory of DNA Sequencing

Keep Reading

Most Popular

A Roomba recorded a woman on the toilet. How did screenshots end up on Facebook?

Robot vacuum companies say your images are safe, but a sprawling global supply chain for data from our devices creates risk.

A startup says it’s begun releasing particles into the atmosphere, in an effort to tweak the climate

Make Sunsets is already attempting to earn revenue for geoengineering, a move likely to provoke widespread criticism.

10 Breakthrough Technologies 2023

Every year, we pick the 10 technologies that matter the most right now. We look for advances that will have a big impact on our lives and break down why they matter.

These exclusive satellite images show that Saudi Arabia’s sci-fi megacity is well underway

Weirdly, any recent work on The Line doesn’t show up on Google Maps. But we got the images anyway.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.