Crucial as they are, identifying and comparing genes for clues to their function are just first steps on a long path toward medical relevance-developing a drug can take many years longer. But computational scientists say pattern mining could have much nearer-term payoffs when applied to another type of genomic data known as “gene expression profiles.”
A gene’s expression level refers to how many copies of its specific protein it is being called upon to make at any given time. The proteins are the actual workhorses in the cell, carrying out the daily tasks of metabolism; the levels of each can vary dramatically over time, and are often out of kilter in diseased cells. Thanks to devices known as DNA microarrays, or, more familiarly, “DNA chips,” scientists can now for the first time regularly measure the expression levels of thousands of genes at once. DNA chips take advantage of the fact that to make a protein, a cell first “translates” a gene into multiple copies of a molecule called messenger RNA (mRNA). The type and quantity of mRNAs in a cell correspond to the proteins on order-and by measuring the levels of thousands of different mRNAs at once, DNA chips are able to create a snapshot of the activity of thousands of genes.
Mark Buguski, senior investigator at NCBI, says the new data on gene expression levels are “unlike anything biologists have ever been exposed to.” Before, biologists could only analyze the activity of a few genes at a time. Now, DNA chips can produce a “massively parallel” readout of cellu-lar activity. That’s an important advance, because the difference between health and disease usually lies not in the activity of a single gene but in the overall pattern of gene expression.
A team at the Whitehead/MIT Center for Genome Research is putting this massively parallel readout to work identifying telltale differences between different cancers. Known as the Molecular Pattern Recognition group, it was started last year by genome center director Eric Lander and is led by molecular biologist Todd Golub. Other members include ex-IBM mathematician Jill Mesirov, computer scientist Donna Slonim, and computational physicist Pablo Tamayo, who joined Whitehead from the supercomputer company Thinking Machines.
This interdisciplinary brain trust is trying to solve an enormously important problem in pattern recognition. Tumors vary in subtle ways, and cancer cells that look the same under a microscope respond very differently to drugs. “The things we call a single type of cancer are surely many types of cancer,” says Lander, “but we don’t know what [differences] to look for.”
To provide a benchmark for the new methods, Lander’s group started with two types of leukemia that can already be distinguished under the microscope: acute myeloid leukemia (AML) and acute lymphoid leukemia (ALL). They measured the levels of about 6,800 different genes in bone marrow samples from 38 leukemia patients, which they would mine for patterns that could distinguish AML from ALL. But working with 6,800 parameters (the genes) and only 38 data points (the samples) made for a task akin to trying to forecast an election by polling a dozen people. After running through a year’s supply of pencils and scratch paper, they hit on a solution.
A key step involved feeding the data points into a learning algorithm known as a “self-organizing map.” By plotting the 38 samples into a high-dimensional mathematical space, the map algorithm was able to partition the samples into two groups-one for each type of cancer. Checking against information about the known tumor types, Lander says, it became clear that the clusters broke out the ALL and AML samples almost perfectly. “We showed that if you hadn’t known the distinction between these two types of leukemias-which in fact took 40 years of work to establish-you would have been able to recapitulate that in one afternoon,” he says.
The research team also got an inkling of how valuable their methods (still unpublished as TR went to press) could be for patients. At one point, the algorithms failed to categorize a sample into either of the leukemia categories. Was the math flawed? No-the diagnosis was. Prompted by the program’s result, doctors took another look and found what they had believed was leukemia was in fact a highly malignant muscle cancer, for which the patient is now being treated. At Cambridge, Mass.-based Millennium Pharmaceuticals, researchers are betting similar approaches will lead to “optimal diagnostic tests” for cancer, according to Dave Ficenec, a former astrophysicist hired by Millennium to install the latest data-mining algorithms in its in-house software. The company collaborates closely with Lander’s center-Lander is a Millennium co-founder who sits on the company’s board of directors.
The new parallel methods for making snapshots of gene expression are also being used to evaluate new drug candidates. At startup Rosetta Inpharmatics in Kirkland, Wash., a scientific team is assembling and mining databases for gene patterns to speed drug discovery. Rosetta studies yeast cells, exposing them to potential new drugs and then analyzing levels of gene expression for clues to the drugs’ actions. For example, the cells can be rapidly checked to see whether their response matches a pattern typical of toxic side effects. Tossing out such losers early on is part of Rosetta’s program of “improving the efficiency of drug discovery,” says Stephen Friend, who doubles as Rosetta’s chief science officer and head of the molecular pharmacology program at Seattle’s Fred Hutchinson Cancer Research Center. Drug firms have taken notice, with eight signed up as Rosetta partners.