The Protein Grid

Cracking the human genome was just the beginning. Using two major community computing grids, researchers are taking a distributed approach to mapping the structure of human proteins.

Karen Epper Hoffmanarchive page

February 1, 2005

The Human Genome Project gave researchers an important initial roadmap to the human gene sequence, but it’s a map that might prove tough to navigate, given that the function and structure of most of the proteins that do the work for those genes remains a mystery.

That’s why the Human Proteome Folding Project – a recent collaboration between IBM, United Devices, the Institute for Systems Biology and the University of Washington – is picking up where the Human Genome Project left off. Understanding the form and function of these proteins, which are at the core of many diseases and the natural target for many treatment drugs, will ultimately put researchers that much closer to understanding why certain diseases happen and how to treat and cure them.

“The Human Genome Project is the foundation on which this project sits,” say Dr. Rich Bonneau, senior scientist for the Institute for Systems Biology, the Seattle, Washington non-profit research institute that is spearheading the biology research effort for the Human Proteome Folding Project.

But running the computations necessary to create such a catalog could take literally a million years on a state-of-the-art PC – 50 years if you used a substantially more powerful commercial 1,000-node cluster computer. But by ‘borrowing’ unused computing cycles from volunteers who download a program to their PCs – a la SETI@Home – researchers believe they can get at least a rough sketch of more than 100,000 proteins before the end of this year.

Tens of thousands of people have already downloaded the program through IBM’s and United Devices’ grids, both of which are being used for this program, and that number is hoped to quickly reach into the millions.

The Beginning of a Beautiful Friendship

The Institute for Systems Biology (ISB) was approached about 18 months ago by United Devices, an Austin technology firm that makes software for grid computing. The software company had a vested interest in life science-related projects as it counts five of the world’s six biggest pharmaceutical companies as customers, and has been running its own distributed community grid since April 2001.

As with the well-known SETI@Home project, computer users can go online to United Devices’ site at www.grid.org, and download a client that runs computations on their computer’s unused cycles. Using the huge computational power of its own grid, United Devices has run previous projects looking into anti-viral leads for smallpox, and screening molecules to find treatments for anthrax.

When United Devices approached the Institute for Systems Biology, says Ed Hubbard, the software company’s president and founder, he was looking for “problems that they might solve” by harnessing the grid’s computing strength.

And, as it turned out, mapping human protein structures was a good fit.

Researchers at the Institute were already using Rosetta, a software package developed at the University of Washington to predict the structure of proteins. But while Rosetta offers a good forecast for what these proteins might look like, its predictions are not infallible. Researchers still need to run countless computations to determine the accuracy of a protein’s three-dimensional structure.

That’s where the grid comes in handily.

“I had the idea for some time that Rosetta would be the ideal application for the grid,” Bonneau says.

The United Devices grid, with more than three million participants, already runs about a petaflop of computing power – that means that the system can run 1,000 trillion calculations per second, making the grid about as powerful as all the supercomputers in the world combined.

Indeed, Bonneau has already had an early insight into how much the grid could speed up the Institute’s work: He says his team had been working with Rosetta for two years to get half-way through predicting the structure of a yeast genome – a project they were able to finish up on the grid in two weeks.

Two Grids, No Waiting

Together, the ISB and United Devices brought in IBM – with whom United Devices had worked on its previous grid-based smallpox project. IBM was in the process of setting up its own World Community Grid, which it officially launched in November 2004.

The Human Proteome Folding Project became its first effort – and indeed the first time grid computing has been used for a biology-related project.

Since its launch, Stan Litow, president of IBM’s International Foundation, which helps oversee the World Community Grid, says about 70,000 people have downloaded the software and he expects to have at least one million participants by year end.

“It was so clear that the repetitive calculations [of the protein project] needed the grid,” Litow says. “And this was not a narrow projectit has ramifications across a variety of research areas.”

With two grids at work, Bonneau says that within six months he hopes to have the database populated with upwards of 100,000 protein “domain” structures – a “low-resolution” shot of what the protein looks like at the architecture or fold level. Biologists and medical researchers can, in turn, use that data to get a better (if still not exact) idea of what proteins look like and how they interact.

Bonneau expressed hope that success in the Human Proteome Folding Project will lead to a second phase, where the grid is used to model a few key proteins to a higher level of detail. Modeling protein structures down to their atomic detail would give researchers more to work with, but would also be even more computationally intense.

But for now, the Human Protein Folding Project is a necessary next step to better understanding why our bodies do what they do.

“I really like the idea that this project will have usable, practical results,” Bonneau says. “A lot of distributed computing projects don’t produce results that people can relate to.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.