Supercomputing Resurrected

Last year, Japan fired up an ultrafast computer that puts its closest competitors to shame. What will it take for the United States to catch up?

Claire Tristramarchive page

February 1, 2003

Even in a field defined by continuous breakthroughs, the achievement was a shocker: last March the Japanese government fired up a computer that soon proved to be the fastest in the world, in some cases outperforming the next-fastest computer by a factor of 10. The Earth Simulator, built by NEC, took four years to assemble and cost at least $350 million. It quickly delivered real-world scientific results in global-climate modeling, completing simulations that made other computers look crude. Scientists worldwide lined up for the limited amount of computer time available to researchers outside Japan. By June, just weeks after the machine hummed to life, three of the six finalists for the prestigious Gordon Bell awards in high-performance computing had run their projects on the Earth Simulator.

A smattering of articles last spring covered the news, quoting experts who compared the Earth Simulator to Sputnik-another instance of the United States’ having been severely outclassed in a critical technology. But outside the rarefied circles of high-end computing, the story soon died. U.S. computer vendors have been downplaying the achievement, dismissing the Earth Simulator as “old technology” or “too specialized” to be of much use, even insisting that it was a “publicity stunt.” “Give us $400 million to spend on a single computer, and we could build something just as fast,” says Peter Ungaro, vice president of high-performance computing at IBM.

“I love that,” scoffs Gordon Bell, designer of the first minicomputer for Digital Equipment and a luminary in high-performance computing. “How is IBM going to do it? Where is the technology? I want to bet $1,000 that in the next year, IBM can’t match the cost performance of the Earth Simulator on any system they have.” In fact, IBM recently won a Department of Energy contract to build a pair of machines designed to run at two to nine times the speed of the Earth Simulator, but the project will take until 2005 to complete. Like many of those involved in high-powered scientific computing, Bell believes that Japan’s achievement has exposed a gaping hole in the development of supercomputer systems in the United States-a hole that money alone can’t fill.

What happened that allowed NEC to take such a tremendous lead in computing power? Simply put, the Japanese government saw fit to subsidize the development of the world’s most expensive computer. The project’s goal was not to grab bragging rights from the United States, but to advance scientists’ understanding of the global climate by creating a machine that performs better modeling and weather simulations than ever before.

At the same time, U.S. government funding for research on high-end computing was waning in response to the deeply felt U.S. notion that supercomputer developers-like welfare moms-should take care of themselves rather than survive on government handouts. Compared with any other part of the computer market, the market for supercomputers is small and slow growing, so when public funding dried up, private investment in high-performance architectures dried up too. For the past decade or so, the U.S. emphasis in supercomputing has therefore been on linking clusters of commodity processors-those designed for everyday business applications-in what are known as massively parallel configurations. That approach is a stark contrast to the Japanese vision of specialized architectures developed solely for the high-performance market.

Granted, the commodity approach has gone far: at this writing two commodity machines, the twin Hewlett-Packard-built ASCI Q supercomputers at Los Alamos National Laboratory in New Mexico, rank as second-fastest in the world (as measured by Top500.org, a nonprofit analysis group). The idea of harnessing many low-end processors to do complicated tasks has captured the public imagination as well, with projects such as SETI@home, which enlists the desktop computers of more than four million volunteers to scan radio telescope data for patterns indicative of alien intelligence. Beowulf clusters, which use a method developed in 1994 for linking PCs together to maximize their processing power, have made it even easier to reach high-performance levels with relatively low capital investment. Without question, the commodity approach has proved itself for many applications that at one time ran on specialized “big iron.”

But in spite of these gains, the United States has fallen painfully short in the very field where computing muscle matters most and where the nation has the most to gain: in simulating such complex systems as weather on the macroscopic end and protein folding on the microscopic. This simulation capability is increasingly vital for the advancement of basic science, as well as for national security.

Making the private sector pay for this capability is “like the defense industry’s saying nuclear submarines have to have some sort of commercial spinoff,” says Horst Simon, director of the National Energy Research Scientific Computing Center in Oakland, CA, home to the 12th-fastest computer. “We’ve embarked on a direction in the United States that is not going to work.”

The Need for Speed

What are the real advantages of making computers ever faster? Why, after all, can’t we use a machine that takes a month or a week to complete a task instead of a day or an hour? For many problems, we can. But the truth is, we’re just beginning to gain the computing power to understand what is going on in systems with thousands or millions of variables; even the fastest machines are just now revealing the promise of what’s to come.

Take, for instance, greenhouse gases and the way they affect the global climate, one of the problems the Earth Simulator was built to study. With computers fast enough to predict climate changes accurately, we can know with far greater certainty what level of atmospheric carbon dioxide will melt the polar ice caps. Similarly, because the Earth Simulator models the planet’s climate at an incredible degree of granularity, it can carry out simulations that account for the effects of such local phenomena as thunderstorms. These phenomena may affect areas only 10 kilometers wide-in contrast to the 30 to 50 kilometers most weather models use as the standard grid size.

Or take the difficulties we’ve encountered trying to understand and harness nuclear fusion-that perpetually just-out-of-reach panacea for our energy problems. “It can take a decade to perform a single [fusion] experiment,” says Thomas Sterling, faculty associate at the Center for Advanced Computing Research at Caltech. “Faster computers would accelerate these projects by decades, allowing us not only to design safe reactors that give us the power to run the planet, but also to know how to get rid of the waste.”

One recent example of both the promise and the limitations of today’s most powerful computers came from IBM’s ASCI White machine, the world’s fourth-fastest supercomputer, which IBM researchers used to investigate how materials crack and deform under stress. The study, announced last spring, simulated the behavior of a billion copper atoms. A billion certainly sounds like a lot of variables-until you realize that it would take more than a hundred trillion times that number of atoms to make up even a cubic centimeter of copper.

“There’s a notion out there that high-performance computing is a mature industry, where all the problems have been solved, and we’ve moved on,” says Burton Smith, chief scientist at Cray, a pioneering supercomputer company in Seattle. “That is false. The embarrassment of the Earth Simulator reveals the fact that there is still plenty more understanding to be had.”

WHO MAKES THE MOST SUPERFAST COMPUTERS?
		Specifications of Fastest Machine
Company	Number in Top 500

Name

Speed (Gigaflops)

Location

Hewlett-Packard

137

ASCI Q

7,727

Los Alamos National
Laboratory, NM

IBM

129

ASCI White

7,226

Lawrence Livermore
National Laboratory, CASun Microsystems

HPC 4500

420

Swedish Armed Forces,
Stockholm, SwedenSilicon Graphics

ASCI Blue Mountain

1,608

Los Alamos National
Mountain Laboratory, NMCray

T3E 1200

1,166

Unknown
(U.S. government)NEC

Earth Simulator

35,860

Earth Simulator Center,
Simulator Yokohama, Japan

Custom versus Commodity

Over the last decade, everything we’ve heard about computers has been about making them smaller, faster, cheaper, more like commodities. Our laptops, for instance, have the same capabilities as a Cray computer from the mid-1970s. Then along comes the monstrous Earth Simulator: it’s the size of four tennis courts, and it cost almost twice as much as its closest competition, the ASCI Q machines. If this is the future of supercomputing, what are we to make of it? The machine doesn’t even boast a particularly new architecture: NEC used a technique called vector computing, which dates from Cray’s earliest days.

But beyond its ungainliness and architectural oddities, the Earth Simulator exemplifies an approach to high-performance computing that is fundamentally different from the one followed by most U.S. computer makers today. The Earth Simulator was designed from the bottom up-from its processors to the communications buses that link processors and memory-to be the world’s fastest computer. When will the U.S. approach of linking general-purpose processors like those that serve up Web pages produce a result that can match the performance of a machine explicitly designed for performance? “My view is that you can’t do it,” says Bell. “I simply don’t see a way, with a general purpose computer, of getting from there to here.”

The performance challenge begins with the processor. The data crunched in scientific computation often take the shape of lists of numbers, the values associated with real-world observations. Traditionally, computers acted on these values sequentially, retrieving them from memory one by one. Then, in the early 1970s, Seymour Cray took an intuitive leap: why not design a computer so that its processors can request an entire list, or “vector,” all at once, rather than waiting for memory to respond to each request in turn? Such a processor would spend more time computing and less time waiting for data from memory. From the mid-1970s through the 1980s, Cray’s vector supercomputers set record after record. But they required expensive specialized chips, and vector computing was, therefore, largely abandoned in the United States after 1990, when the notion of massively parallel systems made from off-the-shelf processors took hold.

Vector computing nevertheless remained one of the most efficient ways to handle large-scale simulations, prompting NEC to adopt Cray’s approach when it bid for the government contract to build the Earth Simulator. Once NEC’s architects decided to build for speed rather than standardization, they were free to develop not only specialized processors, but also wider communications pathways between the processors, compounding the hardware’s speed advantage. Many such improvements are built into the NEC SX6, the fundamental building block of the Earth Simulator. “Vector architecture is the best fit for computer simulations of grand challenge’ scientific and engineering problems such as global warming, supersonic-airplane design, and nanoscale physics,” says Makoto Tsukakoshi, a general manager for the Earth Simulator project at NEC.

Yoking together commodity machines with standard commercial networks, on the other hand, shifts the speed burden from hardware to software. Computer scientists must write “parallel programs” that parse problems into chunks, then explicitly control which processors should handle each chunk-all in an effort to minimize the time spent passing bits through communications bottlenecks between processors.

Such programming has proved extremely difficult: a straightforward FORTRAN program becomes a noodly mess of code that calls for rewriting and debugging by parallel-programming specialists. “I hope to concentrate my attention on my research rather than on how to program,” says Hitoshi Sakagami, a researcher at Japan’s Himeji Institute of Technology and a Gordon Bell Prize finalist for work using the Earth Simulator. “I don’t consider parallel computers acceptable tools for my research if I’m constantly forced to code parallel programs.”

It’s not laziness that has kept programmers from finding better ways to write parallel code. “People have worked extremely hard trying to develop new application software based on different algorithms to use parallel machines, with little success,” says Jim Decker, principle deputy director of the Office of Science in the Department of Energy. (Decker’s agency is responsible for basic research in areas such as energy and the environment.) Vector machines often employ their own form of parallel processing, but the mathematics for doing so is far less complicated; Earth Simulator scientists, for example, are able to program using a flavor of the classic FORTRAN computer language that takes a much more direct approach.

A supercomputer comprising large numbers of commercial processors isn’t just hard to program. It has become clear that the gains from adding more processors to a commodity system eventually flatten into insignificance as coaxing them to work together grows more difficult. What really got computational scientists’ hearts racing about the Earth Simulator was not the peak-or maximum number of calculations performed per second-which is roughly four times the capacity of the next fastest machine and in itself is impressive enough. Instead, it was the computer’s capability for real problem solving (which, after all, is what scientists care about). The Earth Simulator can crunch computations at up to 67 percent of its peak capacity over a sustained period. In comparison, the massively parallel approach-well, it doesn’t compare.

“If you throw enough of these commodity processors into a system, and you’re not overwhelmed by the cost of the communications network to link them together, then you might eventually reach the peak performance of the Earth Simulator,” says Sterling. “But what is rarely reported publicly about these systems is that their sustained performance is frequently below five percent of peak, or even one percent of peak.”

Although it’s certainly cheaper to build supercomputers out of commodity parts, many computational scientists suspect that the cost of developing parallel software actually makes it more expensive to run scientific applications on such a machine.

“People have gotten enamored of the low cost for what sounds like a very high level of performance on commodity machines,” says Decker. “But they aren’t really cheaper to build. We need to look at sustained performance, as well as the cost of developing software. Software costs are generally larger than hardware costs, so if there are hardware approaches that make it easy to solve the problem, we’re better off investing in hardware. In hindsight, I believe we would have been better off taking a different path.”

Playing Catch-Up

If the Earth Simulator really were viewed as another Sputnik, right about now the U.S. government would be budgeting some serious cash for supercomputing research and development. After all, NASA spent more than $19 billion-roughly $80 billion in today’s dollars-on the Apollo missions to put a man on the moon. But George W. Bush is no John F. Kennedy, and the race to overtake the Earth Simulator has captured neither the public’s imagination nor a corresponding level of public funding.

There is, however, one recurring driver for U.S. spending on supercomputing-the need for more computational power to simulate nuclear weapons performance in place of underground testing. In November the Energy Department’s National Nuclear Security Administration awarded IBM a three-year, $267 million contract to build two supercomputers-ASCI Purple and Blue Gene/L-which are projected to have more combined processing power than today’s 500 fastest supercomputers put together. And in response to the Earth Simulator, the U.S. Defense Advanced Research Projects Agency has begun its own, more modest program to fund supercomputer research. The agency is starting with individual $3 million research grants to industry leaders Cray, IBM, Hewlett-Packard, Silicon Graphics, and Sun Microsystems, to be followed by additional funds if technology milestones are met. The new defense projects have begun to reenergize the industry. Cray’s Burton Smith says, “This really marks the end of a fairly long period where the government

hasn’t been involved in computer research and development.”

Another way to jump-start U.S. supercomputing might be through additional investment in vector computing, an approach the Earth Simulator proves was prematurely abandoned by U.S. developers. “The current state of high-performance architecture goes back to the robustness of the 25-year-old basic Cray vector architecture that NEC adopted and continues to improve,” says Bell. “United States architects have rejected the architecture while failing to develop competitive alternatives.”

But where will new vector computers be developed? Just as the United States once fell behind in consumer electronics, it also rapidly lost the expertise needed to build such systems. Out of the 40-odd companies that specialized in high-performance computing in the late 1980s, only Cray remains in business, having survived acquisition by Silicon Graphics in 1996 and Tera in 2000. (Tera promptly changed its own name to Cray in acknowledgment of the company’s singular prominence in the field.) Cray is the only U.S. supercomputer maker to support vector processing. Last fall Cray started shipping a new vector computer, the X1; fully loaded, the machine will outpace the Earth Simulator by almost 50 percent, the company says. But Cray has not yet sold an X1 system that powerful, and while the U.S. Army and the Department of Energy are evaluating its potential, other customers are understandably wary of depending on an architecture that is supported by only one relatively small company.

All this leaves U.S. supercomputing hopes pinned for the most part on entirely new architectures-those that may have the potential to outperform vector computing. Although DARPA’s grants were initiated in response to the Earth Simulator, they are meant to achieve a fundamentally different goal: to make supercomputers that are not only faster, but also cheaper to build and use than anything previously developed. Is there a way to do this all at once? It’s logical to think there should be, but the answer is not apparent. “We’re starving in an era of plenty,” says Bill Dally, professor of electrical engineering and computer science at Stanford University. “All the ingredients for computing-arithmetic, communications, memory-are getting cheaper, and for almost anything besides [high-performance] computing, costs per unit get cheaper as you scale.”

As with any new architecture, the greatest challenge will be the memory bottleneck, where data’s comparatively slow trek to and from memory hampers processor efficiency. Even as it continues to pursue its own vector-computing systems, Cray is attacking that problem by adapting the interprocessor communications techniques used in the Earth Simulator to increase data flow in Red Storm, a massively parallel computer to be built at Sandia National Laboratories in New Mexico.

Dally believes he has developed another solution, which he calls “streaming.” While traditional computing architecture treats all arithmetic operations equally, the calculations in many scientific simulations build on themselves, with no need to store intermediary values in long-term memory. So rather than passing control from one instruction to another and accessing memory sequentially, Dally is building a system that streams data to processors, which then act on the problem locally through many intermediate calculations and stream the finished values back to memory. “Protein folding is an example where to understand how two molecules interact, you need to carry out perhaps 500 intermediate results before arriving at the one you want,” says Dally. “With streaming you can capture those intermediary results in local registers where the communications bandwidth is very inexpensive, and you never touch the memory system.”

Advances in microprocessor-manufacturing techniques are also making it possible to put processors and large amounts of memory on the same chip, shortening the distance that instructions and data must travel. IBM plans to try such processor-in-memory techniques in supercomputing, and DARPA is funding an effort at the University of Southern California to explore exactly how processor-in-memory technology can improve high-performance systems. Researchers at the University of Southern California are working with Hewlett-Packard to deliver an experimental system to DARPA for evaluation. And a project at Caltech known as Gilgamesh is examining the best way to arrange memory and logic together on a chip: if small blocks of memory and logic were interspersed around a chip, for instance, travel time would continue to drop, enhancing performance.

Another option is simply to turn the entire architecture on its head. Cray’s Burton Smith and Caltech’s Sterling are cooperating on a DARPA-funded project they call Cascade. The two are investigating ways to exploit the fact that in high-end scientific computing, the heft of the data alone is often far greater than the heft of the application program. In other words, if the stuff stored in memory is so much larger than the program needed to run it, why move it at all? Why not move the program to the memory instead? They are willing to share their optimism, but Smith and Sterling say little else about their nascent architecture. “I’m purely subjective, but I think it’s the most exciting thing in high-performance computing in the last 20 years,” says Sterling.

These are the proposals eliciting the most interest at the government agencies likely to be buying the high-performance systems of the future. But for the most part they are prototypes, or even less, and they will prove themselves only after radical changes are made in everything from processor design to the way software is engineered. Then the challenge will remain to find a way to manufacture the new systems efficiently and without defects. In other words, these ideas offer no quick fix.

Current and Proposed Supercomputer Architectures
Architectural Approach	Description	Advantages	Main Proponents
Commodity clusters (operational)	Hundreds or thousands of off-the-shelf servers with low-bandwidth links	Low-cost contruction; efficient with problems that can be broken into chunks	Hewlett-Packard, IBM, Silicon Graphics
Vector computing (operational)	Hundreds of custom-built orocessors with high-bandwidth connectors	More time spent computing, less time communicating	Cray, NEC
Streaming (experimental)	Intermediary values of calculations stored in local memory	Speed; on-chip data transfer for reducing the “memory bottleneck”	Stanford University
Processor-in-memory (experimental)	Processing circuits and short-term memory interspersed on the same chip	Speed; shorter distance between processors and memory	University of Southern California, Caltech, IBM
Cascade (experimental)	Data, rather than software, held in processor’s local memory	Fewer calls to memory in cases where data sets are larger than programs	Cray, Caltech

Computing’s Apollo Project?

For the last decade, the U.S. high-performance-computing community has been standing on the shoulders of giants. Many directors of centers for scientific computing say they believe the United States is at a critical decision point, where choice of projects and the amount of funding invested in new high-performance-computing architectures could affect future security and prosperity in tangible ways.

“It’s really going to take a combination of good ideas coming out of universities and government funding and good industrial engineering to address this nasty problem,” says Bell. “Building a new chip is right at the hairy edge of what a university can accomplish; then you need someone with the resources to do detailed engineering stuff like cooling and connections and so on. It’s going to take a lot of effort.”

But if it’s done right, an entirely new golden age of science could flower. One of the most striking aspects of the Earth Simulator project is its openness. Scientists are communicating despite language and geographical barriers. They are testing theories and conducting simulations that have the potential to improve our understanding of the world and benefit all of us. A few months ago, Sterling brokered a meeting between Tetsuya Sato, director of the Earth Simulator facility, and John Gyakum, a McGill University professor who is one of the world’s leading experts on the ways small weather systems such as thunderstorms affect global weather patterns. Before the Earth Simulator there had been no computer that could easily factor such small systems into large-scale climate simulations. Now there may be. “They have opened themselves to collaboration because they care above all about scientific results,” says Sterling. “And what they’re doing is important to everyone on the planet.”

So it’s not just to advance computer science that more and smarter computing is required. It’s to advance every science. “Science in the 21st century rests on three pillars,” says the Energy Department’s Decker. “As always, there’s theory and experiments. But simulation is going to be the third pillar for scientific discovery. Given the problems we’re faced with, we clearly want to be at the cutting edge with our science. If the performance of our computers is an order of magnitude less than what we know they can be even today, then we won’t be.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.