Mining the Genome

The Human Genome Project piles up Everests of data. But getting new drugs out of it will require sophisticated software for sniffing out patterns–one of the most crucial tasks of the hot field known as bioinformatics.

Antonio Regaladoarchive page

September 1, 1999

Larry Hunter had just moved into his new office when a reporter visited, so the room lacked knickknacks and family snapshots. Hunter had, however, started unpacking his books, and they were already beginning to form an interesting pattern. Roger Schank’s Dynamic Memory, a classic title in artificial intelligence, was shelved next to Georg Schulz’s Principles of Protein Structure. Machine Learning flanked Oncogenes. Artificial Life leaned on Medical Informatics.

Properly interpreted, the pattern on Hunter’s bookshelf reveals the latest trend in biology, a field now so overwhelmed by information that it is increasingly dependent on computer scientists like Hunter to make sense of its findings. An expert in an offshoot of artificial intelligence research known as machine learning, in which computers are taught to recognize subtle patterns, Hunter was recently lured from a solitary theoretical post in the National Library of Medicine to head the molecular statistics and bioinformatics section at the National Cancer Institute (NCI)-a group formed in 1997 to use mathematical know-how to sift the slurry of biological findings.

Where is all the data coming from? The simple answer is that it’s washing out of the Human Genome Project. Driven by surprise competition from the commercial sector, the publicly funded effort to catalog the estimated 100,000 human genes is nearing its endgame; several large academic centers aim to finish a rough draft by next spring. By then, they will have dumped tens of billions of bits of data into the online gene sequence repository known as GenBank, maintained by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) in Bethesda, Md. And DNA sequences aren’t the only type of data on the rise. Using “DNA chips,” scientists can now detect patterns as thousands of genes are being turned on and off in a living cell-adding to the flood of findings.

“New kinds of data are becoming available at a mind-blowing pace,” exults Nat Goodman, director of life sciences informatics at Compaq Computer. Compaq is one of many companies seeking an important commercial opportunity in “bioinformatics.” This congress of computers and biology is a booming business, but has so far revolved mostly around software for generating and managing the mountain of gene data. Now, pharmaceutical companies need ever-faster ways to mine that mountain for the discoveries that will lead to new treatments for disease.

That’s where entrepreneurial researchers such as Larry Hunter come in. On Hunter’s bookshelf sits a glass bauble reading: “$2,000,000 Series A Preferred. March 5, 1999”-a celebration of venture capital funds raised by Molecular Mining, a company he co-founded. The firm, based in Kingston, Ontario, hopes to use data-mining methods to help pharmaceutical companies speed the development of new drugs by identifying key biological patterns in living cells-such as which genes are turned on in particularly dangerous tumors and which drugs those tumors will respond to. And a dozen other startups-the biotech industry’s best indicator of a hot trend-have been formed to make data-mining tools (see “The Genome Miners”). “Biology,” Hunter predicts, “will increasingly be underpinned by algorithms that can find hidden structure in massive amounts of molecular data.” This kind of data-mining work, which Hunter specializes in, is often known as “pattern recognition” and it’s one of the fastest-moving areas in bioinformatics. Indeed, if Hunter is right, pattern recognition might turn out to be the pick that brings forth the gold of new therapies.

The Genome Miners

A sampling of companies specializing in pattern-recognition software.

Company Location Highlight Bioreason
(private) Santa Fe, N.M. Artificial intelligence software makes sense of chemistry data. Compugen
(private) Tel Aviv, Israel Ex-Israeli defense contractors are scoring big in genetic data-mining. Customers include U.S. Patent Office. IBM
(public) Armonk, N.Y. Advanced pattern-recognition algorithms power a 1997 Monsanto alliance for protein discovery. Lion Bioscience
(private) Heidelberg, Germany $100 million pact with drug giant Bayer sets a bioinformatics record. Molecular Mining
(private) Kingston, Ontario Raised $2 million in startup funds from venture capitalists in March. Neomorphic
(private) Berkeley, Calif. Hidden Markov models are among this 1996 startup’s advanced gene-finding tools. Partek
(private) St. Peters, Mo. Neural networks specialists moved into biology market in 1998. Silicon Genetics
(private) San Carlos, Calif. Stanford spinoff mines gene data for profit. Silicon Graphics
(public) Mountain View, Calif. Mine Set visual data-mining tool is popular in the financial, telecom and drug industries.

First You Have to Find Them

To get a sense of how big the mountain Hunter and his colleagues are tunneling into, consider the fact that every human cell has 23 pairs of chromosomes containing about 3.5 billion pairs of nucleotides, the chemical “letters” A, C, G and T that make up DNA’s genetic code. But the actual genes that carry code to make proteins, and go wrong in genetic diseases and cancer, occupy less than 3 percent of the genome; the rest is genetic noise. Making genes still trickier to unearth is the fact that their protein-coding elements are scattered, as are the genetic signals that the cell uses to stitch them back together and guide their “expression”: the process that activates them to make proteins. “The key to understanding the genome is understanding the language of these signals,” says David Haussler, a leading computational biologist at the University of California at Santa Cruz. “But they are hidden, and they are noisy.”

The first crucial problem is to extract them from this maze of irrelevant code. At Oak Ridge National Laboratory, Edward Uberbacher’s Computational Biosciences Section has tackled the gene-finding problem with artificial neural networks-a type of artificial intelligence (AI) program distinguished by its capacity to learn from experience. At Oak Ridge, neural nets had been used for jobs such as recognizing enemy tanks in fuzzy satellite images; in 1991, Uberbacher adapted these methods to create a program, called GRAIL, that can pick out genes. Since then, GRAIL has been joined by at least a dozen other gene-finding programs, many of which are available to researchers online.

The current gene-locating programs are far from perfect, sometimes predicting genes that aren’t real and often missing genes that are. Partly because of accuracy problems, says Uberbacher, “these methods have been on the fringe for a while. ” But given the accelerating flood of genome data, biologists will be forced to rely on-and improve-them. “Imperfect as they are, they are the best place to start,” says Lisa Brooks, program director of the National Human Genome Research Institute’s genome informatics branch, whose operation doles out $20 million a year to support bioinformatics databases and to develop new data-mining methods.

Pattern-recognition programs aren’t used only for discovering genes; they’re also heavily exploited to give researchers clues as to what genes do. Today the most widely used program-the NCBI’s Basic Local Alignment Search Tool, or BLAST-receives 50,000 hits per day from researchers searching for similarities between newly discovered DNA sequences and ones whose roles are already understood. Given similar sequences, scientists can often deduce that two genes have similar functions.

In researchspeak, the process of interpreting a gene’s function and entering it into a database is called “annotation.” In May, London’s Sanger Center and the European Bioinformatics Institute (EBI), a branch of the multinational European Molecular Biology Laboratory in Hinxton, England, announced a hastily organized project known as EnsEMBL. The goal of EnsEMBL, says EBI’s Alan Robinson, is “to make sure the first draft of the human genome will have annotation attached.” EnsEMBL’s first activity will be to send out gene-finding algorithms to rove the genome and bring back a rough picture of where the genes are-a prospector’s hand-drawn map. With the map drawn, EnsEMBL will use tools such as BLAST to guess at the genes’ functions.

Plans for computerized discovery pipelines like this one are important to pharmaceutical companies, who are racing to identify-and patent-key disease-causing genes. In June, for example, the German drug giant Bayer agreed to pay a Heidelberg startup, Lion Bioscience, as much as $100 million for an automated system to mine genetic databases. Lion has dubbed the computerized approach “i-biology,” according to its head of bioinformatics Reinhard Schneider, and is promising Bayer that in five years its computers will discover 500 new genes, as well as annotate 70 genes Bayer has already found. Pattern-recognition algorithms, which will drive the daily scourings of the databases, lie at the core of i-biology.

Although the Bayer-Lion pact is a record-breaker, it is just one among dozens of data-mining alliances between pharmaceutical giants and computationally savvy startups-evidence that mathematical methods are taking center stage in genomic research. And the academics who write the algorithms also find their stars rising, especially in industry. Lion was founded by top bio-infonauts from the European Molecular Biology Laboratory, headquartered in Heidelberg. At Celera Genomics, the Rockville, Md., company whose plans to decipher the genetic code have shaken up the Human Genome Project and accelerated the publicly funded work, success rides on the expertise of pattern analysis expert Eugene Myers. Celera lured Myers from a tenured position at the University of Arizona to head its informatics efforts, hiring Compaq to build him what’s being touted as the world’s most powerful civilian supercomputer (see “The Gene Factory,” TR March/April 1999). According to Haussler, most scientists think the success of Myers’ methods will “make or break” Celera.

Cancer Categorizer

Crucial as they are, identifying and comparing genes for clues to their function are just first steps on a long path toward medical relevance-developing a drug can take many years longer. But computational scientists say pattern mining could have much nearer-term payoffs when applied to another type of genomic data known as “gene expression profiles.”

A gene’s expression level refers to how many copies of its specific protein it is being called upon to make at any given time. The proteins are the actual workhorses in the cell, carrying out the daily tasks of metabolism; the levels of each can vary dramatically over time, and are often out of kilter in diseased cells. Thanks to devices known as DNA microarrays, or, more familiarly, “DNA chips,” scientists can now for the first time regularly measure the expression levels of thousands of genes at once. DNA chips take advantage of the fact that to make a protein, a cell first “translates” a gene into multiple copies of a molecule called messenger RNA (mRNA). The type and quantity of mRNAs in a cell correspond to the proteins on order-and by measuring the levels of thousands of different mRNAs at once, DNA chips are able to create a snapshot of the activity of thousands of genes.

Mark Buguski, senior investigator at NCBI, says the new data on gene expression levels are “unlike anything biologists have ever been exposed to.” Before, biologists could only analyze the activity of a few genes at a time. Now, DNA chips can produce a “massively parallel” readout of cellu-lar activity. That’s an important advance, because the difference between health and disease usually lies not in the activity of a single gene but in the overall pattern of gene expression.

A team at the Whitehead/MIT Center for Genome Research is putting this massively parallel readout to work identifying telltale differences between different cancers. Known as the Molecular Pattern Recognition group, it was started last year by genome center director Eric Lander and is led by molecular biologist Todd Golub. Other members include ex-IBM mathematician Jill Mesirov, computer scientist Donna Slonim, and computational physicist Pablo Tamayo, who joined Whitehead from the supercomputer company Thinking Machines.

This interdisciplinary brain trust is trying to solve an enormously important problem in pattern recognition. Tumors vary in subtle ways, and cancer cells that look the same under a microscope respond very differently to drugs. “The things we call a single type of cancer are surely many types of cancer,” says Lander, “but we don’t know what [differences] to look for.”

To provide a benchmark for the new methods, Lander’s group started with two types of leukemia that can already be distinguished under the microscope: acute myeloid leukemia (AML) and acute lymphoid leukemia (ALL). They measured the levels of about 6,800 different genes in bone marrow samples from 38 leukemia patients, which they would mine for patterns that could distinguish AML from ALL. But working with 6,800 parameters (the genes) and only 38 data points (the samples) made for a task akin to trying to forecast an election by polling a dozen people. After running through a year’s supply of pencils and scratch paper, they hit on a solution.

A key step involved feeding the data points into a learning algorithm known as a “self-organizing map.” By plotting the 38 samples into a high-dimensional mathematical space, the map algorithm was able to partition the samples into two groups-one for each type of cancer. Checking against information about the known tumor types, Lander says, it became clear that the clusters broke out the ALL and AML samples almost perfectly. “We showed that if you hadn’t known the distinction between these two types of leukemias-which in fact took 40 years of work to establish-you would have been able to recapitulate that in one afternoon,” he says.

The research team also got an inkling of how valuable their methods (still unpublished as TR went to press) could be for patients. At one point, the algorithms failed to categorize a sample into either of the leukemia categories. Was the math flawed? No-the diagnosis was. Prompted by the program’s result, doctors took another look and found what they had believed was leukemia was in fact a highly malignant muscle cancer, for which the patient is now being treated. At Cambridge, Mass.-based Millennium Pharmaceuticals, researchers are betting similar approaches will lead to “optimal diagnostic tests” for cancer, according to Dave Ficenec, a former astrophysicist hired by Millennium to install the latest data-mining algorithms in its in-house software. The company collaborates closely with Lander’s center-Lander is a Millennium co-founder who sits on the company’s board of directors.

The new parallel methods for making snapshots of gene expression are also being used to evaluate new drug candidates. At startup Rosetta Inpharmatics in Kirkland, Wash., a scientific team is assembling and mining databases for gene patterns to speed drug discovery. Rosetta studies yeast cells, exposing them to potential new drugs and then analyzing levels of gene expression for clues to the drugs’ actions. For example, the cells can be rapidly checked to see whether their response matches a pattern typical of toxic side effects. Tossing out such losers early on is part of Rosetta’s program of “improving the efficiency of drug discovery,” says Stephen Friend, who doubles as Rosetta’s chief science officer and head of the molecular pharmacology program at Seattle’s Fred Hutchinson Cancer Research Center. Drug firms have taken notice, with eight signed up as Rosetta partners.

Brain Drain

While researchers at companies and universities are jumping on the data-mining bandwagon, they are likely to encounter plenty of bumps in the road ahead. Some investors, for instance, remain concerned that databases of different biological results are still poorly interconnected, and sometimes of uneven quality. Says Larry Bock, an investor at the Palo Alto office of the venture firm CW Group: “It may be a bit early for data-mining, since your ability to mine is directly related to the quality of the database.” Still, says Barbara Dalton, vice president at the venture firm SR One in West Conshohocken, Pa., “the long-term prospects look good.” SR One, along with Princeton, N.J.’s Cardinal Health Partners, anted up $2 million to finance Larry Hunter’s startup, Molecular Mining. “Data-mining is going to be a core part” of drug discovery, Dalton predicts.

But before that happens, the field may have to break its most serious bottleneck: an acute shortage of mentors. Bioinformatics has grown explosively during the 1990s, drawing many of the best university teachers and researchers into the high-paying private sector. “We went from very little interest in bioinformatics, to-Bang!-having most of the people working in companies,” says Mark Adams, who left the academic track to work for the Cambridge, Mass., biotech company Variagenics. With universities drained of some of their brightest minds, many wonder who will train the next generation of computational biologists.

Part of the answer came in June, when a special advisory panel convened by NIH director Harold Varmus concluded the U.S. government should spend as much as $10 million to fund 20 new “programs of excellence” in biomedical computing. Several universities have also gotten into the act, including Johns Hopkins, where a new computational biology program is under way, thanks to a $2.5 million grant from the Burroughs Wellcome Fund. Stanford, Princeton and the University of Chicago are all planning major centers that will bring physical scientists together with biologists.

In industry, the convergence is already reality. One-third of Rosetta Inpharmatics’ 100 employees are computational scientists, drawn from fields as diverse as sonar detection, air traffic control and astrophysics. Chief scientist Stephen Friend says he’s come to an important realization since joining the company in 1997. Biologists may still ask the best questions and design the most compelling experiments, he says, but “the best answers are coming from the physicists or mathematicians.” Those answers are likely to lead to important new therapies-gold extracted from the mountains of the Human Genome Project by the tools of pattern recognition.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.