Parallel Universe

In an effort to move forward, Intel dusts off old supercomputing technology.

Robert X. Cringleyarchive page

December 22, 2008

When Anwar Ghuloum came to work at Intel in 2002, the company was supreme among chip makers, mainly because it was delivering processors that ran at higher and higher speeds. “We were already at three gigahertz with Pentium 4, and the road map called for future clock speeds of 10 gigahertz and beyond,” recalls Ghuloum, who has a PhD from Carnegie Mellon and is now one of the company’s principal engineers. In that same year, at Intel’s developer conference, chief technology officer Pat Gelsinger said, “We’re on track, by 2010, for 30-gigahertz devices, 10 nanometers or less, delivering a tera-instruction of performance.” That’s one trillion computer instructions per second.

But Gelsinger was wrong. Intel and its competitors are still making processors that top out at less than four gigahertz, and something around five gigahertz has come to be seen, at least for now, as the maximum feasible speed for silicon technology.

It’s not as if Moore’s Law–the idea that the number of transistors on a chip doubles every two years–has been repealed. Rather, unexpected problems with heat generation and power consumption have put a practical limit on processors’ clock speeds, or the rate at which they can execute instructions. New technologies, such as spintronics (which uses the spin direction of a single electron to encode data) and quantum (or tunneling) transistors, may ultimately allow computers to run many times faster than they do now, while using much less power. But those technologies are at least a decade away from reaching the market, and they would require the replacement of semiconductor manufacturing lines that have cost many tens of billions of dollars to build.

So in order to make the most of the technologies at hand, chip makers are taking a different approach. The additional transistors predicted by Moore’s Law are being used not to make individual processors run faster but to increase the number of processors inside a chip. Chips with two processors–or “cores”–are now the desktop standard, and four-core chips are increasingly common. In the long term, Intel envisions hundreds of cores per device.

But here’s the thing: while the hardware problem of overheating chips lends itself nicely to the hardware solution of multicore computing, that solution gives rise in turn to a tricky software problem. How do you program for multiple processors? It’s Anwar Ghuloum’s job to figure that out, with the help of programming groups he manages in the United States and China.

Microprocessor companies take a huge risk in adopting the multicore strategy. If they can’t find easy ways to write software for the new chips, they could lose the support of software developers. This is why Sony’s multicore PlayStation 3 game machine was late to market and still has fewer game titles than its competitors.

The Problem with Silicon
For the first 30 years of microprocessor development, the way to increase performance was to make chips that had smaller and smaller features and ran at higher and higher clock speeds. The original Apple II computer of 1977 used an eight-bit processor that ran at one megahertz. The PC standard today is a 64-bit chip running at 3.6 gigahertz–effectively, 28,800 times as fast. But that’s where this trajectory seems to end. By around 2002, the smallest features that could be etched on a chip using photolithography had shrunk to 90 nanometers–a scale at which unforeseen effects caused much of the electricity pumped into each chip to simply leak out, making heat but doing no work at all. Meanwhile, transistors were crammed so tightly on chips that the heat they generated couldn’t be absorbed and carried away. By the time clock speeds reached five gigahertz, the chip makers realized, chips would get so hot that without elaborate cooling systems, the silicon from which they were made would melt. The industry needed a different way to improve performance.

Because of the complex designs that high-speed single-core chips now require, multiple cores can deliver the same amount of processing power while consuming less electricity. Less electricity generates less heat. What’s more, the use of multiple cores spreads out whatever heat there is.

Most computer programs, however, weren’t designed with multiple cores in mind. Their instructions are executed in a linear sequence, with nothing happening in parallel. If your computer seems to be doing more than one thing at a time, that’s because the processor switches between activities more quickly than you can comprehend. The easiest way to use multiple cores has thus been through a division of labor–for example, running the operating system on one core and an application on another. That doesn’t require a whole new programming model, and it may work for today’s chips, which have two or four cores. But what about tomorrow’s, which may have 64 cores or more?

Revisiting Old Work
Fortunately, says Leslie Valiant, a professor of computer science and applied mathematics at Harvard University, the fundamentals of parallelism were worked out decades ago in the field of high-performance computing–which is to say, with supercomputers. “The challenge now,” says Valiant, “is to find a way to make that old work useful.”

The supercomputers that inspired multicore computing were second-generation devices of the 1980s, made by companies like Thinking Machines and Kendall Square Research. Those computers used off-the-shelf processors by the hundreds or even thousands, running them in parallel. Some were commissioned by the U.S. Defense Advanced Research Projects Agency as a cheaper alternative to Cray supercomputers. The lessons learned in programming these computers are a guide to making multicore programming work today. So Grand Theft Auto might soon benefit from software research done two decades ago to aid the design of hydrogen bombs.

In the 1980s, it became clear that the key problem of parallel computing is this: it’s hard to tear software apart, so that it can be processed in parallel by hundreds of processors, and then put it back together in the proper sequence without allowing the intended result to be corrupted or lost. Computer scientists discovered that while some problems could easily be parallelized, others could not. Even when problems could be parallelized, the results might still be returned out of order, in what was called a “race condition.” Imagine two operations running in parallel, one of which needs to finish before the other for the overall result to be correct. How do you ensure that the right one wins the race? Now imagine two thousand or two million such processes.

“What we learned from this earlier work in high-performance computing is that there are problems that lend themselves to parallelism, but that parallel applications are not easy to write,” says Marc Snir, codirector of the Universal Parallel Computing Research Center (UPCRC) at the University of Illinois at Urbana-Champaign. Normally, programmers use specialized programming languages and tools to write instructions for the computer in terms that are easier for humans to understand than the 1s and 0s of binary code. But those languages were designed to represent linear sequences of operations; it’s hard to organize thousands of parallel processes through a linear series of commands. To create parallel programs from scratch, what’s needed are languages that allow programmers to write code without thinking about how to make it parallel–to program as usual while the software figures out how to distribute the instructions effectively across processors. “There aren’t good tools yet to hide the parallelism or to make it obvious [how to achieve it],” Snir says.

**Bright lights:** In 1987, Thinking Machines released its CM-2 supercomputer (above), in which 64,000 processors ran in parallel. The company declared bankruptcy in 1994, but its impact on computing was significant.

To help solve such problems, companies have called back to service some graybeards of 1980s supercomputing. David Kuck, for example, is a University of Illinois professor emeritus well known as a developer of tools for parallel programming. Now he works on multicore programming for Intel. So does an entire team hired from the former Digital Equipment Corporation; in a previous professional life, it developed Digital’s implementation of the message passing interface (MPI), the dominant software standard for multimachine supercomputing today.

In one sense, these old players have it easier than they did the last time around. That’s because many of today’s multicore applications are very different from those imagined by the legendary mainframe designer Gene Amdahl, who theorized that the gain in speed achievable by using multiple processors was limited by the degree to which a given program could be parallelized.

Computers are handling larger volumes of data than ever before, but their processing tasks are so ideally suited to parallelization that the constraints of Amdahl’s Law–described in 1967–are beginning to feel like no constraints at all. The simplest example of a massively parallel task is the brute-force determination of an unknown password by trying all possible character combinations. Dividing the potential solutions among 1,000 processors can’t help but be 1,000 times faster. The same goes for today’s processor-intensive applications for encoding video and audio data. Compressing movie frames in parallel is almost perfectly efficient. But if parallel processing is easier to find uses for today, it’s not necessarily much easier to do. Making it easier will require a concerted effort from chip makers, software developers, and academic computer scientists. Indeed, Illinois’s UPCRC is funded by Microsoft and Intel–the two companies that have the most to gain if multicore computing succeeds, and the most to lose if it fails.

Inventing New Tools
If software keeps getting more complex, it’s not just because more features are being added to it; it’s also because the code is built on more and more layers of abstraction that hide the complexity of what programmers are really doing. This is not mere bloat: programmers need abstractions in order to make basic binary code do the ever more advanced work we want it to do. When it comes to writing for parallel processors, though, programmers are using tools so rudimentary that James Larus, director of software architecture for the Data Center Futures project at Microsoft Research, likens them to the lowest-level and most difficult language a programmer can use.

“We couldn’t imagine writing today’s software in assembly language,” he says. “But for some reason we think we can write parallel software of equal sophistication with the new and critical pieces written in what amounts to parallel assembly language. We can’t.”

That’s why Microsoft is releasing parallel-programming tools as fast as it can. F#, for example, is Microsoft’s parallel version of the general-purpose ML programming language. Not only does it parallelize certain functions, but it prevents them from interacting improperly, so parallel software becomes easier to write.

Intel, meanwhile, is sending Ghuloum abroad one week per month to talk with software developers about multicore architecture and parallel-programming models. “We’ve taken the philosophy that the parallel-programming ‘problem’ won’t be solved in the next year or two and will require many incremental improvements–and a small number of leaps–to existing languages,” Ghuloum says. “I also tend to think we can’t do this in a vacuum; that is, without significant programmer feedback, we will undoubtedly end up with the wrong thing in some way.”

In both the commercial and the open-source markets, other new languages and tools either tap the power of multicore processing or mask its complexity. Among these are Google’s MapReduce framework, which makes it easier to run parallel computations over clusters of computers, and Hadoop, an open-source implementation of MapReduce that can distribute applications across thousands of nodes. New programming languages like Clojure and Erlang were designed from the ground up for parallel computing. The popular Facebook chat application was written partly in Erlang.

Meanwhile, MIT spinoff Cilk Arts can break programs written in the established language C++ into “threads” that can be executed in parallel on multiple cores. And St. Louis-based Appistry claims that its Enterprise Application Fabric automatically distributes applications for Microsoft’s .Net programming framework across thousands of servers without requiring programmers to change a single line of their original code.

The Limits of Multicore Computing

Just as Intel’s dream of 10- and 30-gigahertz chips gave way to the pursuit of multicore computing, however, multicore itself might be around for a matter of years rather than decades. The efficiency of parallel systems declines with each added processor, as cores vie for the same data; there will come a point at which adding an additional core to a chip will actually slow it down. That may well set a practical limit on the multicore strategy long before we start buying hundred-core PCs.

Does it matter, though? While there may be applications that demand the power of many cores, most people aren’t using those applications. Other than hard-core gamers, few people are complaining that their PCs are too slow. In fact, Microsoft has emphasized that Windows 7, the successor to the troubled Windows Vista, will use less processing power and memory than Vista–a move made necessary by the popularity of lower-power mobile computing platforms and the expected migration of PC applications to Internet-based servers. A cynic might say that the quest for ever-increasing processing power is strictly commercial–that semiconductor and computer companies, software vendors, and makers of mobile phones need us to buy new gizmos.

So what’s the downside if multicore computing fails? What is the likely impact on our culture if we take a technical zig that should have been a zag and suddenly aren’t capable of using all 64 processor cores in our future notebook computers?

“I can’t wait!” says Steve Wozniak, the inventor of the Apple II. “The repeal of Moore’s Law would create a renaissance for software development,” he claims. “Only then will we finally be able to create software that will run on a stable and enduring platform.”

“In schools,” says Woz, “the life span of a desk is 25 years, a textbook is 10 years, and a computer is three years, tops. Which of these devices costs the most to buy and operate? Why, the PC, of course. Which has residual value when its useful life is over? Not the PC–it costs money to dispose of. At least books can be burned for heat. Until technology slows down enough for computing platforms to last long enough to be economically viable, they won’t be truly intrinsic to education. So the end of Moore’s Law, while it may look bad, would actually be very good.”

Robert X. Cringely has written about technology for 30 years. He is the author of Accidental Empires: How the Boys of Silicon Valley Make Their Millions, Battle Foreign Competition, and Still Can’t Get a Date.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.