Interpreting the Genome

New technologies will soon make it possible to sequence thousands of human genomes. Now comes the hard part: understanding all the data.

Emily Singerarchive page

December 22, 2008

The 12 prototypes look like prefabricated children’s forts–boxes the size of freezers, faced with bright red plastic and grouped in twos and threes on a concrete floor at Pacific Biosciences, a startup in Menlo Park, CA. But the simple exterior of the machines belies the complexity within. Each box houses a small chip packed with thousands of strands of DNA from bacteria or viruses, each strand in a nano-sized well. An enzyme stuck to the bottom of each well speedily builds a corresponding strand, stringing together the bases, or chemical subunits of DNA, that pair properly with those of the original. Each of the four types of bases, represented by the letters A, T, C, and G, is labeled with a different fluorescent marker, which is activated by the reaction that attaches a new base to the strand. Because the machine tracks the reactions as they happen, it can churn out reams of raw data on the sequences of the DNA samples as fast as a built-in camera can record them.

**Superfast Sequencing:** Prototype machines at Pacific Biosciences are being tested with bacterial DNA.

A computer monitor installed next to each machine displays a snapshot of the action taking place. A series of lights scatter across the screen, bursting and fading in quick succession. Each flash lasts just tens of milliseconds, but its color indicates which of the four bases has just been added to a strand of DNA, and its position indicates where. The video must be slowed for viewing: the flashes come too fast for the human eye to process. Computer algorithms convert the pattern of flashes into DNA sequences hundreds to thousands of bases long. Additional algorithms then compare millions of these stretches of DNA, identify sequences that overlap at their ends, and fit the pieces together to capture a complete genome.

When it comes to sequencing DNA, time is money, and Pacific Biosciences’ commercial machines, due out in 2010, could prove to be the fastest ever made. It took the Human Genome Project roughly $300 million and 13 years to work out the sequence of the three billion DNA base pairs in a composite human genome, a task completed in 2003. By October 2008, researchers using a variety of new types of machines were saying that they could sequence an individual genome for less than $100,000; one company promises a $5,000 genome by next spring. And Pacific Biosciences predicts that by 2013, its machines will be able to sequence a person’s genome in 15 minutes, for less than $1,000.

Up to now, scientists have sequenced the genomes of a handful of people, and that’s given them a general sense of human variability. But fast, cheap sequencing technology could make it practical to read the genomes of thousands, perhaps millions, of people. By combing through those myriad genomes and linking specific DNA sequences to different characteristics–handedness, height, blood pressure, and susceptibility to anxiety, to name a few–scientists should be able to unravel the complex interplay of genetic variants that makes each individual unique. Most important, that kind of sequencing capacity might finally reveal the inherited basis of common diseases–a riddle that has been taunting geneticists for decades.

Reporter's Notebook: Emily Singer

The actual impact on medicine, however, is far less certain and may be much less positive. For almost two decades, researchers have promised that advances in sequencing technology will enable doctors to practice personalized medicine, targeting treatments to patients on the basis of their genetic profiles. The assumption was that a limited number of common genetic variants would turn out to underlie a particular disease, and physicians would be able to prescribe drugs according to which variants their patients carried. But the latest data suggest that even the most common heritable illnesses, such as diabetes and heart disease, are linked to many different variants, each of them relatively rare. If that’s true, then practicing personalized medicine could become very complicated–and very expensive. “It would not be good to have a $5,000 genome and a $500,000 analysis,” says Francis Collins, the former director of the National Human Genome Research Institute and a leader of the Human Genome Project.

Beyond Common Variations
Genomic medicine began in earnest in the 1980s, when scientists identified genes linked to diseases such as Duchenne muscular dystrophy and cystic fibrosis. Both are so-called Mendelian diseases, meaning that they’re caused by mutations in a single gene; anyone who inherits either one or two copies of the mutated gene, depending on the disease, will be afflicted. Over the last 20 years, researchers have identified genes for a number of Mendelian disorders, and screening tests based on these discoveries have led to earlier diagnoses. In the case of disorders that develop only when a person inherits two copies of the mutation, the tests can identify healthy carriers, helping them make better-informed decisions about having children. Single-gene disorders, however, make up a very small percentage of human diseases. For most diseases, it’s much harder to pinpoint the genetic culprits.

As scientists began assembling a rough draft of the genome sequence in the late 1990s, they uncovered a useful phenomenon. Large blocks of DNA, known as haplotype blocks, tended to be passed down intact through generations. Different versions of these blocks, which were linked to an individual’s ancestral origins, had characteristic patterns of common genetic variations known as single-nucleotide polymorphisms (SNPs), in which the genetic sequence varies by just one DNA letter. Thus, a telltale SNP could serve as a marker for its surrounding DNA. The discovery was a boon to geneticists–if each block tended to occur in a limited number of varieties within the human population, it would be unnecessary to check every base in the genome for variations linked to common diseases such as asthma or schizophrenia. The presence of a particular SNP would indicate which haplotype block an individual carried.

Researchers developed genetic microarrays that could quickly detect the presence of these common SNPs throughout the genome; by scanning for the telltale variations, a relatively inexpensive process, the microarrays have enabled the largest genomic studies to date. Scientists have used them to efficiently search tens of thousands of human genomes for SNPs more common in people with autism or Alzheimer’s, for example, than in healthy people. Over the last two years, a flood of studies have been published, identifying more than 300 genetic variations linked to an assortment of common traits and diseases.

But finding these variations has not led to the breakthrough that some scientists had hoped for in understanding the genetic basis of common diseases. That’s because they turn out to account for only a small fraction of the genetic risk for many illnesses. Researchers have identified 18 genes linked to type 2 diabetes, for example, and tests to identify the variations have been introduced. Yet many other heritable risk factors for the disease remain unidentified. That means that the new tests give an incomplete picture of how likely someone is to develop diabetes, making it difficult to use them to tailor medical decisions. “There is very little reason to be encouraged that prevention strategies can be revolutionized with what we’ve discovered so far [on the genetic basis of common diseases],” says David Goldstein, director of the Center for Population Genomics and Pharmacogeneticsat Duke University in Durham, NC.

The hunt for SNPs makes sense if the inherited risk for diseases like type 2 diabetes results from a combination of many common genetic variations, each exerting a small effect. But what if that is only part of the story? What if other, rarer types of genetic mutations are also playing a role? Because microarrays were designed to detect common SNPs, they miss variations that appear in less than 1 percent of the population. These mutations are the focus of an alternative hypothesis, in which–as in the Mendelian model–high-impact individual variations contribute heavily to a disease. Any one of the variations may occur infrequently, according to this thinking, but if they affect the same or related biochemical pathways, they may produce similar outcomes. Collectively, they could make a disorder relatively common.

Until recently, only limited efforts had been made to search for rare variants linked to common diseases. This search may involve sifting through every letter of DNA–something that can only be done by sequencing. With the old technology, that was too expensive to be practical. But in view of the disappointing results from microarray studies, scientists are turning to the fast new sequencing technologies to rigorously test the rare-variant hypothesis. It’s likely that “much of the rest of the heritability [of disease] is hiding in rare variants with high impact,” Collins says. “If we really want to understand the genomics of disease, we need complete genome sequences.”

It’s still unclear how much rare variations contribute to disease, but evidence is starting to trickle in. In a study published this summer, biologists at the University of California, Berkeley, sequenced the gene for an enzyme called MTHFR, which converts the B vitamin folate (folic acid) from one form into another. Scientists had previously identified a common genetic variant that produces a weakened version of the enzyme, increasing the risk of birth defects and possibly of heart disease. By sequencing the MTHFR gene in 564 people of different ethnicities, Nick Marini and colleagues found four new variants that also impair the enzyme’s function; present in fewer than 1 percent of the subjects, these variants would have been undetectable in microarray studies.

The Personal Genome
At a recent conference at the venerable Cold Spring Harbor Laboratory on Long Island, James Watson, codiscoverer of the structure of DNA, sat slouched in the front row of the auditorium beneath a large portrait of himself. Watson, who for a time headed the Human Genome Project, had his genome sequenced in 2007. His was only the second individual genome to be completely mapped. (Craig Venter, who led the private effort to sequence the genome, used his own DNA as the sample.)

Watson isn’t known for sitting through successive conference presentations. But a good portion of this conference was about him. He attended talk after talk, as scientists presented their analyses of what has become affectionately known as “Project Jim.” Watson is a seemingly healthy 80-year-old man, and the results of scrutinizing his genome have so far been fairly mundane. He has extra copies of genetic variations shown in previous studies to protect against heart disease and macular degeneration, for example. An initially worrying mutation in the BRCA1 gene, which is linked to breast cancer, turned out to be harmless. But the vast majority of Watson’s genome remains uninterpretable. Scientists have yet to find a genetic component to his intelligence or his curiosity or his tendency toward politically incorrect outbursts. Perhaps most important to Watson, it’s not yet clear whether he harbors a genetic vulnerability to schizophrenia that he passed along to his son, who has the disease.

The Human Genome Project’s reference sequence, which is a composite of genetic information from more than 20 individuals, gave scientists a basic blueprint of the genome. But a single genome has its limits. It’s only by comparing multiple genomes that scientists can begin to get a handle on the genetic variability that underlies the vulnerability to disease or madness, the tendency to athletic prowess or mathematical genius, the drive toward altruism or aggression.

Even Watson, who has spent his career trying to understand DNA, seems less than impressed to see the details of his genome presented. “We’ll see if any of it adds five minutes to my life span,” he remarked at the conference. Indeed, the meaning of most of his genetic quirks will remain a mystery until many more people join him in having their genomes sequenced.

**The PGP 10:** The first 10 volunteers in the Personal Genome Project are currently having the coding regions of their genomes sequenced; a small piece of sequence is shown for those whose data is posted online. The sequence data will be stored in a public database, along with the volunteers’ medical records and other information, such as their facial morphology (as measured by the forehead tapes). Scientists will use the database, which is expected eventually to include 100,000 people, to search for links between genes and diseases or other characteristics.

Harvard Medical School geneticist George Church, who has been working on sequencing technology since his PhD research at Harvard in the early 1980s, aims to speed that process along. Three years ago Church launched the Personal Genome Project (PGP), which aims to collect genetic and medical data from thousands of people over the next five years. The project indicates not just the technical and scientific challenges that might be posed by large-scale sequencing of human genomes, but the ethical issues as well.

In the pilot phase, the project will focus on 10 volunteers, including Church, Harvard psychologist Steven Pinker, and entrepreneur Esther Dyson. To start, it will sequence the coding regions of their genomes–the 1 percent of DNA that directs the production of proteins. That information, along with the participants’ medical histories (including prescription regimens) and information about their height, weight, handedness, and other traits, will be deposited in a public database. Church’s team hopes that this database will serve as a resource for scientists, or even members of the public, who want to search for links between specific genetic variations and diseases or other traits.

The first set of data–released to participants in October–hints at both the promise of sequencing and the current limitations of genetic analysis. John Halamka, CIO of Harvard Medical School and another one of the 10 original volunteers, learned that he carries a mutation for Charcot Marie-Tooth disease, an inherited neurological disorder. This rare variation would not have been found with existing SNP arrays. But since Halamka survived childhood unscathed, and only three other people in the world have been shown to carry that particular mutation, it’s hard to know what impact, if any, it has had on his health. Perhaps many people carry the variation with no ill effect, and the link between the disease and the mutation has been overstated. Or perhaps the gene has a broader impact than expected, raising the risk of other neurological diseases. (Or, as George Church notes, the finding may simply be an error.)

The greater the number of entries in the database, the easier it will be to understand a finding like Halamka’s. And in April 2008, Church’s team received approval from Harvard to expand the project from 10 to 100,000 participants. (Church plans to scale up slowly, multiplying the number of subjects by 10 each year.) This next phase will seriously test both the technology used to sequence the genomes and the strategies used to interpret the resulting data. As of November, about a year into the project, PGP scientists had gotten only about a fifth of the way through sequencing the coding regions of the original volunteers’ genomes. (Church plans to expand the PGP to the entire genome once sequencing becomes cheap enough.) If they’re to sequence thousands more genomes, sequencing technology will need to become as fast and robust as Church believes it can be.

Too Much Information
Making use of the data from the PGP will pose problems of its own. First, Church and his team will need to figure out the best way to give the larger group of volunteers their results. The first 10 received one-on-one genetic counseling from Joseph Thakuria, the project’s medical director and a clinical geneticist at Harvard Medical School. But Thakuria won’t be able to counsel the thousands of new subjects. Given the shortage of geneticists and genetic counselors with appropriate training, that problem is almost certain to be echoed much more broadly as personal genomics becomes more accessible.

But the greatest challenge in the next phase of human genomics is likely to be interpreting the meaning of the seemingly endless array of variations that will be uncovered. Individual genetic changes occur by chance, and some are harmless. Others happen to be dangerous, disrupting some vital cellular process and raising the risk of disease. And some may even be beneficial–enhancing the breakdown of toxins, for example, and thus protecting against certain ailments. But it’s often impossible to tell which class a variation falls into just by looking at it. And as new technologies allow scientists to sequence the genomes of large numbers of people, the list of known variants will quickly grow. “This information is going to be thorny and problematic in terms of interpretation,” says James Evans, a professor of genetics and medicine at the University of North Carolina at Chapel Hill. “We all have mutations and alterations that we simply don’t understand. As usual, the technology will be ahead of our ability to use it.”

The complexity of the new genomic information may also be an obstacle to the personalized medicine that gene sequencing was supposed to usher in. Researchers have hoped to create tests that predict an individual’s risk for a specific disease or reveal which drug is likely to work best for him or her. But genetic tests that detect newly discovered variations won’t be very useful until scientists can figure out what those variations mean. And if many common diseases are caused by rare variants, the task will be enormous. “Understanding risk based on rare variants is going to take us years,” says Dietrich Stephan, founder and chief science officer of Navigenics, a personal-genomics startup.

Some scientists think that the real value of genomics may not lie in personalized medicine at all. Where it will really pay off, they say, will be in deepening our understanding of disease and helping researchers discover new targets for drugs. “The primary value of genetic mapping is not risk prediction, but providing novel insights about mechanisms of disease,” wrote David Altshuler, a physician and geneticist at the Broad Institute in Cambridge, MA, in a recent article published in the journal Science. In fact, Altshuler points out, identifying even rare genetic changes can end up helping a large number of patients. For example, studies of an inherited form of high cholesterol found in less than 0.2 percent of the population led to the discovery of the low-density lipoprotein (LDL) receptor, which helps to remove excess cholesterol from the bloodstream. That in turn led to the development of the blockbuster drugs known as statins, cholesterol-lowering medications that trigger an increase in the number of LDL receptors on the surfaces of liver cells.

No one knows when the next blockbuster will arrive. Making predictions about the benefits of genomics has become as thankless as trying to predicting disease risk itself. And the easier it gets to sequence a genome, the harder it becomes to make sense of the complexity the sequences reveal. As Collins puts it, “The Human Genome Project was perhaps a simple undertaking compared to what we face next.”

Emily Singer is Technology Review’s senior biomedical editor.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.