First You Have to Find Them
To get a sense of how big the mountain Hunter and his colleagues are tunneling into, consider the fact that every human cell has 23 pairs of chromosomes containing about 3.5 billion pairs of nucleotides, the chemical “letters” A, C, G and T that make up DNA’s genetic code. But the actual genes that carry code to make proteins, and go wrong in genetic diseases and cancer, occupy less than 3 percent of the genome; the rest is genetic noise. Making genes still trickier to unearth is the fact that their protein-coding elements are scattered, as are the genetic signals that the cell uses to stitch them back together and guide their “expression”: the process that activates them to make proteins. “The key to understanding the genome is understanding the language of these signals,” says David Haussler, a leading computational biologist at the University of California at Santa Cruz. “But they are hidden, and they are noisy.”
The first crucial problem is to extract them from this maze of irrelevant code. At Oak Ridge National Laboratory, Edward Uberbacher’s Computational Biosciences Section has tackled the gene-finding problem with artificial neural networks-a type of artificial intelligence (AI) program distinguished by its capacity to learn from experience. At Oak Ridge, neural nets had been used for jobs such as recognizing enemy tanks in fuzzy satellite images; in 1991, Uberbacher adapted these methods to create a program, called GRAIL, that can pick out genes. Since then, GRAIL has been joined by at least a dozen other gene-finding programs, many of which are available to researchers online.
The current gene-locating programs are far from perfect, sometimes predicting genes that aren’t real and often missing genes that are. Partly because of accuracy problems, says Uberbacher, “these methods have been on the fringe for a while. ” But given the accelerating flood of genome data, biologists will be forced to rely on-and improve-them. “Imperfect as they are, they are the best place to start,” says Lisa Brooks, program director of the National Human Genome Research Institute’s genome informatics branch, whose operation doles out $20 million a year to support bioinformatics databases and to develop new data-mining methods.
Pattern-recognition programs aren’t used only for discovering genes; they’re also heavily exploited to give researchers clues as to what genes do. Today the most widely used program-the NCBI’s Basic Local Alignment Search Tool, or BLAST-receives 50,000 hits per day from researchers searching for similarities between newly discovered DNA sequences and ones whose roles are already understood. Given similar sequences, scientists can often deduce that two genes have similar functions.
In researchspeak, the process of interpreting a gene’s function and entering it into a database is called “annotation.” In May, London’s Sanger Center and the European Bioinformatics Institute (EBI), a branch of the multinational European Molecular Biology Laboratory in Hinxton, England, announced a hastily organized project known as EnsEMBL. The goal of EnsEMBL, says EBI’s Alan Robinson, is “to make sure the first draft of the human genome will have annotation attached.” EnsEMBL’s first activity will be to send out gene-finding algorithms to rove the genome and bring back a rough picture of where the genes are-a prospector’s hand-drawn map. With the map drawn, EnsEMBL will use tools such as BLAST to guess at the genes’ functions.
Plans for computerized discovery pipelines like this one are important to pharmaceutical companies, who are racing to identify-and patent-key disease-causing genes. In June, for example, the German drug giant Bayer agreed to pay a Heidelberg startup, Lion Bioscience, as much as $100 million for an automated system to mine genetic databases. Lion has dubbed the computerized approach “i-biology,” according to its head of bioinformatics Reinhard Schneider, and is promising Bayer that in five years its computers will discover 500 new genes, as well as annotate 70 genes Bayer has already found. Pattern-recognition algorithms, which will drive the daily scourings of the databases, lie at the core of i-biology.
Although the Bayer-Lion pact is a record-breaker, it is just one among dozens of data-mining alliances between pharmaceutical giants and computationally savvy startups-evidence that mathematical methods are taking center stage in genomic research. And the academics who write the algorithms also find their stars rising, especially in industry. Lion was founded by top bio-infonauts from the European Molecular Biology Laboratory, headquartered in Heidelberg. At Celera Genomics, the Rockville, Md., company whose plans to decipher the genetic code have shaken up the Human Genome Project and accelerated the publicly funded work, success rides on the expertise of pattern analysis expert Eugene Myers. Celera lured Myers from a tenured position at the University of Arizona to head its informatics efforts, hiring Compaq to build him what’s being touted as the world’s most powerful civilian supercomputer (see “The Gene Factory,” TR March/April 1999). According to Haussler, most scientists think the success of Myers’ methods will “make or break” Celera.