Chinese, African Genomes Sequenced

By validating an emerging technology, two new genomic studies offer hope for the fight against disease.

Emily Singerarchive page

November 6, 2008

A male Yoruba from Nigeria and a Han Chinese man joined genetics luminaries James Watson and Craig Venter on Wednesday as the only people to have their genomes sequenced and made publicly available. The two anonymous genomes serve as proof that new sequencing technologies, which are orders of magnitude cheaper than standard methods, are capable of accurately reading the sequence of a complete human genome. That means that scientists will be able to sequence thousands of people, which they hope will finally enable a coherent understanding of the genomic basis of disease.

**Reading DNA:** Densely packing DNA fragments onto this credit-card-size chip from Illumina, called a flow cell, allows high-throughput sequencing. About 50 million clusters of DNA, each containing approximately 1,000 copies of the same fragment, can fit on a flow cell. It currently takes about 40 flow cells to accurately sequence a human genome.

“This brings the time it takes to sequence a human genome from years to months,” says Samuel Levy, director of human genetics at the Craig Venter Institute, in Rockville, MD, who was not involved in the research. “That’s a huge technological advance. It gives us the ability to do the kinds of studies we want to do to associate genetic variations with human traits.”

Over the past decade, the cost of sequencing has dropped dramatically. While the reference sequence generated during the Human Genome Project cost $300 million, Watson’s genome, released last year and sequenced using a technology developed by 454 Life Sciences, in Branford, CT, cost about $1 to 2 million. The Yoruba genome cost an estimated $250,000 and took only two months to complete, using technology from Illumina, a genetics technology company headquartered in San Diego.

New sequencing technologies boost speed and reduce cost by simultaneously sequencing hundreds of thousands of pieces of DNA. For technical reasons, this massive parallelism reduces the number of base pairs–the DNA “letters”–that can be read from each piece. Standard sequencing methods can read 400 to 800 base pairs, but Illumina’s technology can read only 35 to 50. That makes it harder to assemble a complete sequence, which requires computationally sewing the overlapping pieces together.

Because of these short read lengths, it has been unclear how accurately technology from Illumina and other companies could sequence a human genome. In the new studies, published today in Nature, researchers from Illumina and from the Beijing Genomics Institute, in China, show that by sequencing their subjects’ genomes roughly 40 times each, they were able to read 99.9 percent of the sequence in the reference genome. The greater number of sequencing passes–standard sequencing requires only about 6 to 10 passes–is necessary to compensate for shorter read lengths. But even with the extra passes, the new technology is much cheaper.

The scientists were able to verify the accuracy of their sequences by comparing them with previous genetic analyses of the same genomes. The Yoruba DNA sequenced by David Bentley and his colleagues at Illumina had been used in previous studies that looked for single-nucleotide polymorphisms (SNPs), or genetic variations of a single letter at a time, spread out through the whole genome. Jun Wang and colleagues at the Beijing Genomics Institute, who sequenced the Chinese genome, checked their results against those from a microarray, which is designed to detect thousands of common SNPs.

The two new sequences don’t reveal any genomic surprises. Researchers found approximately four million SNPs in the Yoruba genome, about 26 percent of which had not been previously identified. The Yoruba genome displayed a higher level of genetic diversity than previously sequenced single genomes, but earlier analysis of African DNA had predicted as much. The Chinese genome, in contrast, had about 13.6 percent previously unidentified SNPs.

Scientists hope that the ability to identify novel SNPs will be a boon in the hunt for the genomic basis of disease. Most large genomic studies to date have focused on common genetic variations–those with a frequency of at least 5 percent–because they were the easiest to find. But research suggests that these variations account for only a fraction of the genetic contribution to common diseases. The ability to sequence many human genomes will allow scientists to find more rare variants and to characterize the potentially large role that they play in human health.

Such studies are already under way. The Yoruba genome is part of an international collaboration known as the 1,000 Genomes project, which will serve as a technological test bed for high-volume human sequencing. “You couldn’t do 1,000 genomes with old technologies, but the new technologies are making it possible,” says Lisa Brooks, director of the genetic-variation program at the National Human Genome Research Institute, in Bethesda, MD. Scientists involved in the project aim to catalogue all human variations that appear at about 0.1 percent frequency.

Illumina is not alone in its quest to cheaply sequence human genomes. Applied Biosystems, the company that supplied many of the sequencing machines for the Human Genome Project, has also sequenced the Yoruba genome and is likely to publish its results soon. Two startups, Pacific Biosciences and Complete Genomics, are also hot on the trail. Complete Genomics, for example, promises a $5,000 genome by next year. The company’s scientists have not yet published their results in peer-reviewed journals, however, so the completeness and accuracy of their method has yet to be independently validated. “With this and other data from the 1,000 Genomes project, we will be in good position to properly calibrate these different technologies,” says Richard Gibbs, director of the Human Genome Sequencing Center at Baylor College of Medicine, in Houston, TX.

The two new genomes are also the first non-Caucasian ones to be added to the public database. “They provide a stepping stone to understanding genetic differences between ethnicities,” says Levy, who wrote a commentary accompanying the publication of the two papers.

In the same issue of Nature, scientists from Washington University School of Medicine, in St. Louis, describe using Illumina’s technology to sequence the first complete cancer genome. They found eight previously unidentified mutations, which may shed light on the disease.

In the Illumina sequencing approach, DNA is fragmented into small pieces and molecularly attached to a specially designed slide known as a flow cell. About 50 million fragments fit on a single cell. Each fragment is copied 1,000 times while still stuck to the flow cell. Fluorescently labeled bases, representing the four letters that make up DNA and colored red, green, blue, and yellow, are then added to the cell. The base that corresponds to the letter at the first position in a fragment of DNA will attach to that fragment. A camera then snaps a picture of the fluorescent bases at each of the 50 million locations on the flow cell. The base is then clipped off, and the cycle is repeated for each letter of the DNA fragment. The resulting images are computationally stitched together to generate a sequence.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.