Cooler Supercomputers

“Superclusters” of supercomputers are getting too hot. SGI’s Eng Lim Goh wants to solve that problem.

Wade Rousharchive page

February 14, 2006

Thanks to advances in the speed of supercomputer simulations, complex phenomena such as weather systems, protein folding, and nuclear explosions are becoming easier to model and understand. But only a small part of this speedup is due to faster processors. Instead, the most common way to reach supercomputing capacities is to assemble hundreds or thousands of separate machines in clusters. When yoked together, such a cluster will shares a single memory and can perform massive simulations in parallel by breaking up the work into many small parts.

Even this approach, however, has its limits. For one thing, the larger the memory, the more likely some parts of it will fail during a computation. Also, the more machines that are assembled into a cluster, the more heat they produce. Indeed, in some large computing centers, ventilation and air conditioning systems, fans, and liquid cooling systems are hard-pressed to keep machines from overheating.

Mountain View, CA-based Silicon Graphics, also known as SGI, builds some of the world’s largest supercomputing clusters. The fourth-fastest one in the world, for instance, is Columbia, a system that SGI built for NASA Ames Research Center in 2004. Columbia includes 20 SGI Altix “superclusters,” each with 512 processors, for a total of 10,240 processors that share a 20-terabyte memory. Cooling this behemoth (which NASA uses to model problems involving large amounts of data, such as climate change, magnetic storms, and designs for hypersonic aircraft) is currently a very low-tech affair: it’s accomplished mainly by blowing air past the processors at high speed.

Eng Lim Goh, a computer scientist and chief technical officer at SGI, says one NASA administrator told him, “ ‘I spent millions of dollars on your supercomputer just so we could run simulations that replace our wind tunnel – and you gave us a new wind tunnel.’”

Goh is now the leader of Project Ultraviolet, SGI’s effort to develop its next generation of superclusters. The chips that SGI is designing for Ultraviolet will run applications faster – yet use less electricity and produce less heat. Technology Review interviewed Goh about the project on February 2.

Technology Review: What are the goals of Project Ultraviolet?

Eng Lim Goh: Ultraviolet is where I’ve been spending 80 percent of my time for the last three years, with the goal of having a system shipped by end of this decade. We’re building ASICs [application-specific integrated circuits] to accelerate certain memory functions and to make applications run cooler; and those have a long development cycle, typically two and a half years.

We started with what we have – basically, a system capable of huge memory. We build huge systems, managing up to 512 processors sharing up to tens of terabytes of memory. The advantage of such systems is that you can load huge databases in memory without the ten times slowdown when you have to get data from a disk. You want the ability to hold all the data in memory and zip around at a high speed, which is important for advanced business analytics and intelligence.

Our ASICs fit below the [Intel] Itanium processor, with memory below each of these, and they talk to each other to give a virtual, single view of all the memory to the user and the operating system. We make sure to use this low-cost, off-the-shelf memory. However, along with this came off-the-shelf reliability. So, in Ultraviolet we are putting in features to make memory more reliable. For example, there are intelligent agents in our chipset that can go out and scrub unused memory, to force parts that were about to fail to do so during the scrubbing process, not the application process. The agents quickly deallocate that memory, very much like a bad disk.

TR: Some of what you’re doing, when you talk about agents, sounds like what IBM calls “autonomous computing.”

ELG: People use different names for making computing more self-healing. We were thinking of whether we should use “autonomic memory,” or “self-healing systems,” like IBM and other vendors. But we got a bit concerned because that sets a really high expectation.

TR: What about the heat problem? I assume the systems people will build using your next-generation systems will have even more than 512 processors, all in one room, putting out huge amounts of heat.

ELG: We can force the heat out of the racks with faster fans, but then the computer room becomes very difficult to cool. The only way to deal with more heat is to move the heat faster. There are computer rooms where if you open up a floorboard, they tell you ‘Don’t put your foot in there,’ because the air down there is moving at 100 kilometers per hour.

So the other part of what we want to explore with Ultraviolet is how to reduce this heat and how to deal with applications that are not scaling [that is, do not run as fast as expected when running on more processors in parallel]. These two are related. Let’s say an application runs for 100 seconds on a single processor. And let’s say that on 100 processors it runs ten times faster – it runs for 10 seconds. That is a great improvement – but you’re using 100 times as many processors to get there. As such, you’re only 10 percent efficient; the application is using ten times more energy and putting out ten times more heat than it needs to.

TR: So how do you make applications run cooler?

ELG: One big part is the way we break problems down into pieces and the way we allocate those to the processors. We did an analysis of about 50 customer applications to see what was going wrong with these applications when they’re running in parallel. We identified four or five major areas.

One is communications latency [delays]. The problem is that most applications require constant synchronization, to make sure every process is ready before the next step of the computation. This synchronization uses a lot of time. It’s like having six people trying to stand in a straight line – they have to check with each other. With 60, or 600, or 6,000 people, it takes exponentially longer to get in a straight line.

Second after latency is the communications bandwidth issue. Sometimes you want to transfer a lot of data and the thickness of the connection between the processors will then decide how long it takes for that huge piece of data to get through. If you’re waiting, you’re not computing. That’s another area where efficiency drops.

The third area is load imbalance, which is a huge problem. Say you want to model the weather in your area. You assume that the volume of air in your area is a huge cube, and you divide that into eight sub-cubes and you distribute those sub-cubes to different processors. On a day when the weather is homogeneous across the big cube, the load on those processors may be balanced; but if there is local turbulence in one of the sub-cubes, there will be processors that sit waiting while other processors finish.

The fourth area is when an application needs a piece of data and the data is not in the processor’s own cache, and it has to go out to memory. When it goes out to memory there is a huge latency impact.

So these would be the tenets of Ultraviolet design [more reliable memory, less communications latency, more communications bandwidth, better load balancing, and less memory latency]. Say you have an application that is topping out at 128 processors, because it is bottlenecking on communications latency. This chip we’re designing is going to drastically reduce latency, which will now allow this application to run on more processors. Or, if you’re still running that same application on 128 processors, you should perform better and create less heat.

Caption for home page image: A view from the top: Bridges connect nodes of the 20-node SGI Altix supercomputer housed at the NASA Advanced Supercomputing facility.

Home page image courtesy of NASA Ames Research Center/Tom Trower

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.