Proteins on Demand

The director of Harvard’s Institute of Proteomics wants to create a DNA repository containing a template of every human protein. One goal: accelerated drug discovery.

Technology Reviewarchive page

October 4, 2001

Joshua LaBaer, head of the Harvard Institute of Proteomics, wants to do for proteomics what the Human Genome Project did for genetics: build an infrastructure that will provide raw material for research labs and biotech companies around the world. LaBaer’s dream is to construct a DNA repository containing a template of every human protein. The first templates could start rolling off the production line as early as next year. Technologyreview.com staff editor David Cameron caught up with LaBaer for a rundown of his plans.

TR: What is the purpose of your DNA repository?

LaBaer: Well, you can think of proteomics in two main categories. First is protein profiling, taking tissue specimens and trying to identify and characterize all the proteins. Lots of companies are doing this. Their ultimate goal is to find out the difference between the diseased and the normal tissues. That’s a hot area.

The other category, far less evolved, is functional proteomics. Here the goal is simply to make proteins and figure out what they do. To me, this is the most important element in biology, because in drug discovery you have to know what a particular protein does in order to understand how to block its activity and thus treat disease.

The reason why functional proteomics is less evolved is because no one has the genes in hand to express these proteins and figure out what they do. And that’s what this gene repository will provide-you understand, of course, that genes make proteins.

Protein Templates (Click image for detail.)

TR: How did the project get started?

LaBaer: It started when some of us who study mammalian cells and human cells became very jealous of those who study yeast and bacteria. We decided that we’d like to do in mammalian cells what people have been doing for years in simpler organisms-for example, [creating] mutations and examining their effects on cell division. This is easy in yeast because yeast has 6,000 genes, all organized very simply. There’s a gene and there’s another gene and another gene, and so on and so forth. People could study the effects of proteins in yeast for decades.

TR: Not so with humans?

LaBaer: Right, and that’s what made the rest of us very jealous. You see, people have a lot more genes. We can argue over exactly how many, but it’s more than yeast. And the structure is more complex. Not only are the subsequent proteins more complex, but there can be many different proteins per gene.

One gene, one protein is the old model. That’s true for bacteria and yeast, and somewhat true for the worm. But when you get to the human, all bets are off. There can be many different versions of that protein, based on one gene. And that’s because in people the chromosome structure is far more complicated. In humans, you have a piece of a gene, then some distance, then another piece of this gene, then some distance, and so on. You have to make a copy of this whole stretch of DNA and then cut out what you don’t want and splice it together again.

TR: Is that possible?

LaBaer: Well, there are some tricks to this. Basically, a gene produces RNA, which then serves as a template for making protein. The particular type of RNA that contains the protein template is called messenger RNA, also known as mRNA. If you can get hold of that mRNA and backtrack its sequence, then you can say, ‘Okay, this piece of it maps to this part of the gene, and this piece maps to this part,’ and so forth. But we’ve discovered that there’s often more than one mRNA per gene. So, with such a complicated genomic structure, you can’t easily predict the sequence of human proteins the way you can in yeast (see “Protein Templates” image above).

TR: You can’t analyze every single one of those mRNAs?

LaBaer: People have tried. [The result] is called a complementary DNA, or cDNA, library. This is done by converting mRNA into its DNA equivalent, which is more stable and easier to work with. But there are problems. When you collect mRNA from a cell you’re taking a snapshot of what is expressed in that cell at that moment. And there is a huge dynamic range in any given cell at any given time. Plus, some genes are expressed far more abundantly than others, and those tend to be the boring genes. The less abundant genes are the really interesting ones.

TR: So you could end up with a vast collection of data that doesn’t tell you much.

LaBaer: Think of a building. A building is like a cell. It’s the physical manifestation of a plan that started long ago. That plan was a blueprint. That blueprint is like the DNA of a cell-it never leaves the architect’s office. Now, to build that building, you’re going to photocopy certain pages and hand them to contractors. Electricians get the electrical layout, plumbers get the plumber’s layout, dry wallers get the dry-wall layout. The number of copies will be proportional to the number of times something needs to be built. So, while you’ll have multiple floor plans, you’ll just have a few elevator plans, and even fewer master electrical switchboard plans-but lots and lots of [plans for] doors and windows and dry walls.

So imagine you suddenly stop all construction and say, ‘Everyone, give me your photocopies now.’ Well, your collection would have lots of copies of doors and windows and somewhere buried in that stack would be the master electrical system that regulates the building.

That’s exactly what happens when you make a cDNA library. You stop the cell, collect all these mRNAs and then file through them. That’s why cDNA libraries aren’t a great way to discover what every protein does.

We realized that we needed an indexed cDNA repository of all genes, in a format that’s ready to be used to make proteins. One copy of each-effectively, a Noah’s Ark of genes.

TR: One copy of the floor plan, one copy of the elevator plan, one copy of the electrical plan?

LaBaer: Yes. By having that one copy of everything, you can test a gene once and not have to study it 100,000 times in order to get to that one other gene you’re interested in. In current cDNA libraries, you have to look at millions of events in order to get down to the rare copy. That means a lot of very boring experiments. But our gene repository makes it possible to do more sophisticated experiments.

Ultimately, we’d have this indexed array, so that I could say ‘Tube 1 has gene X, tube 2 has gene Y,’ and so on, so that you can do your experiment in a massive parallel mode and get your answers immediately.

The problem with a cDNA library is that everything’s mixed together, so if someone asks, ‘Did you test gene Q?’ you can say, ‘Well, I think it was in the library, so I think I tested it’-but you can’t really be sure. But with an indexed array you can say, ‘I tested it, and here’s the response.’

That information can then go into a database, which is clearly where science has to go: doing multiple experiments, building that information into databases and gathering information about all the proteins in the proteome.

TR: How many of these made-to-order genes will there be?

LaBaer: Well, as a starting point, the consortium will want to do one mRNA representing each gene. Now, in my dreams I’d like to see every mRNA that we know of in the repository. But in practical terms, that’s expensive.

TR: Is there anyone else doing it? Any companies that will want to make this knowledge proprietary? Because what you’re doing will be in the public domain.

LaBaer: That’s right. Now, keep in mind that we currently do have access to lots of pieces of genes-which is what DNA chips are made out of, gene fragments that are repeated and arrayed on a surface-but we don’t have a library of whole, full-length genes, and I don’t know of any companies making a concerted effort to go after the entire canon of human genes in the way that we are. Basically, it’s an expensive undertaking, and very difficult to do it right. That’s why it really needs to be a consortium.

Another reason why this needs to be a public effort is because this is the physical manifestation of the human genome. That’s far too precious to keep behind a commercial wall. The Human Genome Project needed to be a public project because it was so fundamental to biology and needed to be available to everybody. The same is true of this collection.

Obviously, people have been studying proteins for a long time, and it’s frustrating to hear people talk as if protein biology were something new. The problem is that they study proteins one or just a few at a time. Now we’re thinking: Can we study them 1,000 at a time? Can we express genes in cells like David Sabatini does, thousands at a time? All these high-throughput platforms are going to significantly increase the pace at which we do biology. That’s where we’re all headed.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.