The Gene Factory

An exclusive peek inside the data machine built to beat the Human Genome Project.

Karin Jegalianarchive page

March 1, 1999

On a day of vivid hues last fall, an anxious group of architects, contractors, engineers and scientists gathered in the basement of a building in Rockville, Md. The structure was supposed to be converted by year’s end into the greatest DNA sequencing factory in the world, but the planning meeting confirmed that problems were piling up. Delivery of a crucial steam generator had fallen behind. And it wasn’t even clear that the walls of the 113,000-square-foot office building, which had been occupied by a defense contractor but now stood gutted, would accommodate all the pipes and wires needed to run the new laboratories.

The contractors were uneasy, but if the scientists present in the room weren’t overflowing with sympathy, it was because they had set themselves an even bigger task with an even more dramatic timeline. The researchers work for Celera Genomics Corp., a company formed last May with plans to decode by 2001 all the 3.5 billion chemical letters of DNA that make up human heredity. Celera intends not only to beat by four years the target date originally set by the publicly funded Human Genome Project (which began in 1990), but also to finish the job for a tenth of the government project’s $3 billion price tag.

If these claims were coming from another company, they might be dismissed as outrageous. But Celera is the child of Perkin-Elmer, the instrument company that monopolizes the market for automated DNA sequencing machines, and J. Craig Venter, the most controversial and productive genome researcher in the world. The partners agreed to give TR a preview of the substance behind their ambitious plan, and allowed a reporter to follow along as Celera’s facility came into being.

Much of the scientific expertise that powers Celera comes from The Institute for Genomic Research (TIGR), an independent lab Venter founded in 1992. At TIGR, also in Rockville, Venter’s staff has employed a rapid-fire method known as the “random shotgun” approach to decode the genomes of nearly a dozen bacteria. No other lab has produced more DNA sequence-“readings” of the long strings of chemical letters designated A, C, G and T that make up the DNA molecule. Then again, Venter’s approach has never been tried on anything as large as the human genome, which contains about 1,000 times as much DNA as your average microbe. “It’s hard…to grasp the entire scale of this,” says Venter, now Celera’s president. “I can deal with millions, at least, because I spend them all the time now.”

The money behind Celera comes from Perkin-Elmer, an instrument giant for which the project is a dramatic shift toward controlling data rather than just making and selling equipment. The decision by officials at Perkin-Elmer’s Norwalk, Conn., headquarters to put a powerful new type of DNA sequencer to work for themselves has stunned the biotechnology industry and drawn comparisons to Microsoft’s move into online publishing. The partners pre-empted fears that they might hijack the genome by promising to hand over the data for free (but with a few caveats) to the public sector. Between Venter’s shotgun method and Perkin-Elmer’s deep pockets and new machines, Celera looks as if it could well live up to its name: a play on the word celerity, for rapidity of action.

Factory Tour

Today, four months and many late nights after the anxiety-riven planning meeting, the gene factory is complete. The only sign that the glass building contains what is arguably the world’s most prolific molecular biology lab are two massive air conditioning units crouching in the grass. The chillers, too heavy to sit on top of the building, cool 1,600 cubic meters of air per minute and pipe it into the heart of the facility, where 257 new sequencing machines hum in orderly rows.

The gray, waist-high 3700-model machines were developed over two years in near-secrecy by Perkin-Elmer’s West Coast subsidiary, Applied Biosystems. Just one of those machines, says Venter, has more sequencing capacity than many big academic labs, most of which rely on an earlier model called the 377. Altogether, Venter calculates, Celera can decode nearly as much DNA in one day as all the major labs funded by the Human Genome Project produced last year.

It’s what’s inside these new machines that makes them so fast. Each contains 104 glass capillaries: hollow, hair-thin tubes that the machine can automatically fill with a syrupy polymer and later clean out with a dilute solution of nitric acid. The sequencer’s job is to sort DNA fragments by size. Pulled along by an electric field, small fragments move through the tubes faster than large ones. The capillaries replace cumbersome cafeteria-tray-sized slabs of toxic gel used in previous models, which had to be changed by a skilled technician every few hours. Stocked with chemicals and more than 1,000 DNA samples, the automated 3700 can run for nearly two days without human intervention, says Mark Adams, the young scientist who supervises Celera’s sequencing operation. At full capacity, Celera expects to read 100 million letters of DNA sequence each day.

More than half of Celera’s personnel-backed by eight 6-foot, 64-bit computer servers located in an adjacent building-will be devoted to unscrambling the avalanche of data streaming from the sequencing facilities. Leading the analysis is Gene Myers, an expert on pattern analysis on leave from the University of Arizona’s computer science department.

The challenge Myers’ staff will face is something like reassembling a complete Bible from 10 copies that have been torn into tiny pieces. Since the sequencing machines can read only short stretches of DNA, the genome must first be broken into smaller pieces. Celera scientists began by taking DNA from a number of human cells and chemically shredding it into millions of random, overlapping fragments a few thousand letters long. To keep a library of these fragments, the scientists grafted them into colonies of E. coli bacteria. Following the shotgun strategy, Celera will then sequence 500 letters from each end of a fragment-repeating the process across the entire library yields 70 million separate sequences.

Myers’ task is to develop algorithms that can assemble these elements once their code has been read. Although it sounds like a straightforward job-just line up overlapping letters and start pasting-it is anything but. Take the ripped-up Bible. Common phrases such as “Thou shalt not…” or “Blessed are they…” would make reassembling the good book much harder because some fragments appear to overlap when, in fact, they don’t. The genome is similarly crammed with repeated sequences, some short, some long, some present in a million copies, others repeated only twice.

For that reason, scientists working on the publicly funded Human Genome Project have laboriously mapped out the genome before starting to sequence. Roughly like figuring out where the Bible’s chapters go before tearing up the pages, it means they will then have to reassemble many small piles, rather than one huge one. Elbert Branscomb, director of the Department of Energy’s Joint Genome Institute, thinks Celera’s 70-million-piece puzzle may be unsolvable. “How much of a problem this will be no one even has a moderately good guess,” says Branscomb.

Myers contends that the key to the solution is that Celera’s puzzle pieces come in pairs lifted from the ends of a single fragment, the total length of which they know. The pairs, he believes, will constrain the problem enough to arrive at a unique solution. Outside scientists say Celera’s strategy would be impossible without the sequences already developed at publicly funded labs, but Myers maintains the puzzle could be solved anyway. “Outside information is just an expedient,” he says. “If we were going to do a genome that we have no data about, say Bermuda grass, we could do a self-contained operation.”

Whether or not Celera’s operation represents top-notch science is still a matter of some debate in the genome community. Without a doubt, Celera’s version of the genome will have many, many small gaps. A photocopy, if you will, that gives the big picture and most of the detail but may fall short of the high-fidelity standard envisioned by the Human Genome Project.

After the Genome

The data, however, will be good enough to take to market. Venter has said he will give away the raw sequence for free by downloading it into the online public repository known as GenBank. So what’s left to sell? Quite a lot. Celera’s profits may come largely from licensing to pharmaceutical companies a database that packages the sequence into a more accessible form. Drug companies will mine the data for genes with medical applications, although Venter says Celera will first find and patent several hundred genes for itself. Celera will also hold onto information about single DNA letters that vary between people called “single nucleotide polymorphisms.” These differences may predict a person’s susceptibility to disease or to toxic drug reactions. And beyond the human genome lie others. Monsanto, the large agricultural concern, has already suggested that Celera take on the rice genome.

As Venter likes to point out, finishing the human sequence is simply the beginning of a new era in which the data can be put to use to improve human health. If Celera’s plans work out, this “post-genomic” epoch will be upon us sooner than anticipated. In fact, Celera advanced the timeline for reading the genome before a single wall had been knocked down for the factory’s renovation. Reacting to the unexpected competition, directors of the publicly funded Human Genome Project have announced that they now plan to knock off the entire project by 2003, two years earlier than the original schedule called for. And by 2001, when Celera has promised to unveil its data, public-sector scientists have vowed to come through with a “working draft” to match it. Public genome or private, celerity is definitely the order of the day.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.