A Tool to Make More of Many Cores

Intel’s Ct software will make ordinary code work on forthcoming many-core processors.

Robert X. Cringleyarchive page

April 21, 2009

When it comes to writing computer software, there is a huge difference between “multi” and “many.” Mere humans can write good software for the current breed of multicore processors that have two to four processing units inside a single chip, although this still requires extra skill and patience. The next step is many-core processors with sixteen to hundreds of cores–too many for any programmer to efficiently command.

**Hard cores**: A wafer containing Intel’s Teraflops Research Chip with 80 cores.

That’s why later this year, Intel will release from its lab a research project called Ct (“C for Throughput”) that will automatically make standard C and C++ compilers work with many-core processors, starting with Intel’s first new graphics processor in many years, code-named Larrabee, which is scheduled to ship in early 2010.

Ct, which will be part of Intel’s Parallel Studio software-development tool kit and may have a different name by its ship date, is all about the company’s new orientation toward energy efficiency. “We’re investing the power budget into features people want to use,” explains an Intel engineer who is not authorized to speak for the company, “so we’ll have these ‘eight-wide’ and ‘sixteen-wide’ chips, but without a tool like Ct we’d be leaving three-fourths or seven-eighths or fifteen-sixteenths of that performance potential on the floor, unused. Nobody wants that.”

Embracing parallelism in software has traditionally required programmers to, first, figure out which parts of their code were most easily adaptable to parallel processing, then isolate those parts in a module. But whether the language was C# or Java, the job of isolating and applying parallel code would have to begin again for each new processor family or with a large increase in numbers of cores. Ct, according to Intel, makes all that automatic, and optimizing for many-core processors of the future won’t even require a recompile.

Unlike competing programming architectures such as nVidia’s CUDA, which enable massive parallelism using large numbers of that company’s graphics processors, Ct is backward compatible with the entire body of software written for Intel’s long-running x86 architecture. So presumably, if you want to run your circa-1982 copy of Lotus 1-2-3 efficiently across an eight-core (or higher) processor, Ct could make that. According to the company, Ct isx86-specific, not Intel-specific, so the code will work equally well on processors from Intel’s arch competitor, AMD.

But most important for Intel, Ct will work with Larrabee, the company’s first dedicated graphics chip since the i740 was released in the late 1990s, and its first processor that absolutely needs a tool like Ct to appeal to the 3-D game programmers that are Larrabee’s initial target customers.

For Intel, Larrabee is a chance to enter a whole new market, competing directly with nVidia and with AMD’s ATI graphics division. Larrabee, it turns out, is a fusion of dedicated graphics CPU and x86 technology. “If a software tool exists, it exists on x86,” says the Intel engineer. “We’ll pull the whole x86 ecosystem into the graphics space.”

Larrabee will not be a separate graphics chip in the same sense that an nVidia or ATI GPU is. Yet if Larrabee and Ct work as predicted, the days of discrete graphics processors may soon be over.

“Ct is a good match for Larrabee,” says Marc Snir, head of the High Performance Computing Laboratory at the University of Illinois. “We have thought of Ct as something that is much more attractive than CUDA or OpenCL for developing data-parallel code.”

Snir adds that Ct could become a versatile language for “general-purpose GPU code and the use of GPUs as accelerators for scientific and high-performance computing.”

Intel hasn’t yet announced how many cores there will be in Larrabee when it ships early next year, but a good guess would be 16. That’s 16 cores with 4 execution threads each for a total of 64 threads. With Moore’s Law doubling those numbers every 18 months thereafter, in three years, that’s 256 execution threads on one chip. The big challenge will be making that work with software written for older Intel chips and running all of the traditional application programming interfaces like OpenGL and DirectX efficiently and transparently across 256 threads and more.

Thanks to Ct, programmers apparently won’t even have to know it is happening.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.