Biophysicists Discover Four New Rules of DNA ‘Grammar’

For 60 years, biologists have known of only two grammar-like rules that govern the language of DNA. Now they’ve found four more

Emerging Technology from the arXivarchive page

December 9, 2011

The Austrian biochemist, Erwin Chargaff, is famous for the two rules he discovered that now bear his name. At the time of this discovery, in 1950, the biggest problem in biology was understanding the structure of DNA. Chargaff’s rules turned out to be an important clue in this puzzle.

Biologists had long known that DNA was built out of four molecules: adenine, guanine, thymine and cytosine. They assumed that these molecules occurred in equal quantity and dismissed any measurements that hinted otherwise as experimental errors.

Chargaff showed through careful measurement that this assumption was wrong. He found that the amount of adenine equalled that of thymine and the amount of guanine equalled that of cytosine but these were not equal to each other. The rough figures are: A=T=30% and G=C=20%.

Chargaff’s first parity rule, as this is now called, was an important clue that James Watson and Francis Crick used to develop their base pair model for the double helix structure. Biologists now know that since A binds with T and G binds with C to form a double helix, this rule holds for all double stranded DNA.

Chargaff went on to discover that an approximate version of his rule also holds for most (but not all) single-stranded DNA. That’s much more of a puzzle and biologists still aren’t quite sure why it is true.

Chargaff’s rules are important because they point to a kind of “grammar of biology”, a set of hidden rules that govern the structure of DNA. This grammar ought to reveal itself as patterns in DNA that are invariant across all species.

But in the 60 years since Chargaff discovered his invariant patterns, no others have emerged. Until now.

Today, Michel Yamagishi at the Applied Bioinformatics Laboratory in Brazil and Roberto Herai at Unicamp in Sao Paulo, say they’ve discovered several new patterns that significantly broaden the grammar of DNA.

Their approach is straightforward. These guys use set theory to show that Chargaff’s existing rules imply the existence of other, higher order patterns.

Here’s how. One way to think about the patterns in DNA is to divide up a DNA sequence into words of specific length, k. Chargaff’s rules apply to words where k=1, in other words, to single nucleotides.

But what of words with k=2 (eg AA, AC, AG, AT and so on) or k=3 (AAA, AAG, AAC, AAT and so on)? Biochemists call these words oligonucleotides. Set theory implies that the entire sets of these k-words must also obey certain fractal-like patterns.

Yamagishi and Herai distil them into four equations.

Of course, it’s only possible to see these patterns in huge DNA datasets. Sure enough, Yamagishi and Herai have number-crunched the DNA sequences of 32 species looking for these new fractal patterns. And they’ve found them.

They say the patterns show up with great precision in 30 of these species, including humans, e coli and the plant arabidopsis. Only human immunodeficiency virus (HIV) and Xylella fastidiosa 9a5c, a bug that attacks peaches, do not conform.

“These new rules show for the first time that oligonucleotide frequencies do have invariant properties across a large set of genomes,” they say.

That could turn out to be extremely useful for assessing the performance of new technologies for sequencing entire genomes at high speed.

One problem with these techniques is knowing how accurately they work. Yamagishi and Herai suggest that a simple test would be to check whether the newly sequenced genomes contain these invariant patterns. If not, then that’s a sign the technology may be introducing some kind of bias.

This is a bit like a checksum test for spotting accidental errors in blocks of data and a neat piece of science to boot.

Ref: arxiv.org/abs/1112.1528: Chargaff’s “Grammar of Biology”: New Fractal-like Rules

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.