If the Earth Simulator really were viewed as another Sputnik, right about now the U.S. government would be budgeting some serious cash for supercomputing research and development. After all, NASA spent more than $19 billion-roughly $80 billion in today’s dollars-on the Apollo missions to put a man on the moon. But George W. Bush is no John F. Kennedy, and the race to overtake the Earth Simulator has captured neither the public’s imagination nor a corresponding level of public funding.There is, however, one recurring driver for U.S. spending on supercomputing-the need for more computational power to simulate nuclear weapons performance in place of underground testing. In November the Energy Department’s National Nuclear Security Administration awarded IBM a three-year, $267 million contract to build two supercomputers-ASCI Purple and Blue Gene/L-which are projected to have more combined processing power than today’s 500 fastest supercomputers put together. And in response to the Earth Simulator, the U.S. Defense Advanced Research Projects Agency has begun its own, more modest program to fund supercomputer research. The agency is starting with individual $3 million research grants to industry leaders Cray, IBM, Hewlett-Packard, Silicon Graphics, and Sun Microsystems, to be followed by additional funds if technology milestones are met. The new defense projects have begun to reenergize the industry. Cray’s Burton Smith says, “This really marks the end of a fairly long period where the government
hasn’t been involved in computer research and development.”
Another way to jump-start U.S. supercomputing might be through additional investment in vector computing, an approach the Earth Simulator proves was prematurely abandoned by U.S. developers. “The current state of high-performance architecture goes back to the robustness of the 25-year-old basic Cray vector architecture that NEC adopted and continues to improve,” says Bell. “United States architects have rejected the architecture while failing to develop competitive alternatives.”
But where will new vector computers be developed? Just as the United States once fell behind in consumer electronics, it also rapidly lost the expertise needed to build such systems. Out of the 40-odd companies that specialized in high-performance computing in the late 1980s, only Cray remains in business, having survived acquisition by Silicon Graphics in 1996 and Tera in 2000. (Tera promptly changed its own name to Cray in acknowledgment of the company’s singular prominence in the field.) Cray is the only U.S. supercomputer maker to support vector processing. Last fall Cray started shipping a new vector computer, the X1; fully loaded, the machine will outpace the Earth Simulator by almost 50 percent, the company says. But Cray has not yet sold an X1 system that powerful, and while the U.S. Army and the Department of Energy are evaluating its potential, other customers are understandably wary of depending on an architecture that is supported by only one relatively small company.
All this leaves U.S. supercomputing hopes pinned for the most part on entirely new architectures-those that may have the potential to outperform vector computing. Although DARPA’s grants were initiated in response to the Earth Simulator, they are meant to achieve a fundamentally different goal: to make supercomputers that are not only faster, but also cheaper to build and use than anything previously developed. Is there a way to do this all at once? It’s logical to think there should be, but the answer is not apparent. “We’re starving in an era of plenty,” says Bill Dally, professor of electrical engineering and computer science at Stanford University. “All the ingredients for computing-arithmetic, communications, memory-are getting cheaper, and for almost anything besides [high-performance] computing, costs per unit get cheaper as you scale.”
As with any new architecture, the greatest challenge will be the memory bottleneck, where data’s comparatively slow trek to and from memory hampers processor efficiency. Even as it continues to pursue its own vector-computing systems, Cray is attacking that problem by adapting the interprocessor communications techniques used in the Earth Simulator to increase data flow in Red Storm, a massively parallel computer to be built at Sandia National Laboratories in New Mexico.
Dally believes he has developed another solution, which he calls “streaming.” While traditional computing architecture treats all arithmetic operations equally, the calculations in many scientific simulations build on themselves, with no need to store intermediary values in long-term memory. So rather than passing control from one instruction to another and accessing memory sequentially, Dally is building a system that streams data to processors, which then act on the problem locally through many intermediate calculations and stream the finished values back to memory. “Protein folding is an example where to understand how two molecules interact, you need to carry out perhaps 500 intermediate results before arriving at the one you want,” says Dally. “With streaming you can capture those intermediary results in local registers where the communications bandwidth is very inexpensive, and you never touch the memory system.”
Advances in microprocessor-manufacturing techniques are also making it possible to put processors and large amounts of memory on the same chip, shortening the distance that instructions and data must travel. IBM plans to try such processor-in-memory techniques in supercomputing, and DARPA is funding an effort at the University of Southern California to explore exactly how processor-in-memory technology can improve high-performance systems. Researchers at the University of Southern California are working with Hewlett-Packard to deliver an experimental system to DARPA for evaluation. And a project at Caltech known as Gilgamesh is examining the best way to arrange memory and logic together on a chip: if small blocks of memory and logic were interspersed around a chip, for instance, travel time would continue to drop, enhancing performance.
Another option is simply to turn the entire architecture on its head. Cray’s Burton Smith and Caltech’s Sterling are cooperating on a DARPA-funded project they call Cascade. The two are investigating ways to exploit the fact that in high-end scientific computing, the heft of the data alone is often far greater than the heft of the application program. In other words, if the stuff stored in memory is so much larger than the program needed to run it, why move it at all? Why not move the program to the memory instead? They are willing to share their optimism, but Smith and Sterling say little else about their nascent architecture. “I’m purely subjective, but I think it’s the most exciting thing in high-performance computing in the last 20 years,” says Sterling.
These are the proposals eliciting the most interest at the government agencies likely to be buying the high-performance systems of the future. But for the most part they are prototypes, or even less, and they will prove themselves only after radical changes are made in everything from processor design to the way software is engineered. Then the challenge will remain to find a way to manufacture the new systems efficiently and without defects. In other words, these ideas offer no quick fix.
Current and Proposed Supercomputer Architectures
|Architectural Approach||Description||Advantages||Main Proponents|
|Commodity clusters (operational)||Hundreds or thousands of off-the-shelf servers with low-bandwidth links||Low-cost contruction; efficient with problems that can be broken into chunks||Hewlett-Packard, IBM, Silicon Graphics|
|Vector computing (operational)||Hundreds of custom-built orocessors with high-bandwidth connectors||More time spent computing, less time communicating||Cray, NEC|
|Streaming (experimental)||Intermediary values of calculations stored in local memory||Speed; on-chip data transfer for reducing the “memory bottleneck”||Stanford University|
|Processor-in-memory (experimental)||Processing circuits and short-term memory interspersed on the same chip||Speed; shorter distance between processors and memory||University of Southern California, Caltech, IBM|
|Cascade (experimental)||Data, rather than software, held in processor’s local memory||Fewer calls to memory in cases where data sets are larger than programs||Cray, Caltech|