Speedier Bug Catching

Specialized transistors track hardware bugs as they happen.

Kate Greenearchive page

March 31, 2010

As microprocessors get smaller and more intricate, finding the hardware bugs that can cause a computer to crash requires more time, money, and engineering effort. But now engineers at Stanford University have proposed a shortcut that could help locate bugs in a fraction of the time.

Debugging normally involves putting a chip through a battery of tests to identify spots that are likely to fail and to give engineers a chance to fix problems before the chips go into mass production. As chip-making companies as push the functionality of their hardware, this becomes increasingly complicated.

Subhasish Mitra, professor of electrical engineering and computer science at Stanford and colleagues have developed a method that uses a small number (about 1 percent) of the transistors on a chip to record a log of chip activity–the instructions that pass through the chip’s circuits. This log can be extracted from the chip, dumped into a computer, and analyzed to find out where the bugs are.

“It’s enormously expensive to diagnose where chips are failing,” says Rob Rutenbar, professor of computer science at the University of Illinois, who wasn’t involved with the research. As the features on microprocessors get smaller, Rutenbar says, “people worry more about wear-out and reliability issues.”

Engineers test for bugs throughout the making of a chip. First, they scour the designs to find any so-called functional or logic errors. Then, after the designs have been etched into silicon, engineers look for bugs that crop up under operating conditions such as playing video or browsing the Web. This process is called post-silicon debugging, and 30 to 40 percent of the time and money spent on making a new chip by companies like Intel and AMD is allotted to post-silicon debugging, says Mitra.

During the post-silicon phase, engineers pulse electrical signals through the chip, mimicking the electrical activity seen during normal operation. If a chip fails during these tests, engineers try to re-create the electrical signals that caused the problem. Next, they try to pinpoint the exact set of instructions and conditions responsible for the failure. But simulation takes time: a single second in silicon can be equivalent to days of simulation, says Mitra. Moreover, many of the errors occur due to operating temperatures and workloads that are difficult to re-create. “The trouble is, the whole electrical state of the system changes,” Mitra says.

So Mitra and Stanford graduate student Sung-Boem Park decided to catch evidence of the bugs while they happen, eliminating much of the time spent doing electrical simulations. The challenge was finding the right way to record information about chip instructions without using too many transistors and without storing too much information. To do this, they built recording devices, or buffers, into chips. This is not a new idea. In fact, almost any kind of commercially available chip today has a small number of transistors whose job is to hold small amounts of data about chip activity–to ensure, for instance, that operations are synchronized across the chip.

The Stanford researchers’ approach, called instruction footprint recording and analysis (IFRA), was designed to collect just the right amount of information about the chip’s activity at just the right time. As trillions of instructions stream through a chip, information describing those instructions pass through so-called “circular buffers,” containers that hold information for a short time before being refreshed. When a failure or hint of an impending failure is detected, the system stops recording in the circular buffers and saves the buggy instructions.

When a chip fails, data that represents the chip’s activity has been transferred to a computer. Software developed by the researchers decodes the labels, laying out the instructions–and the corresponding location on the chip–that led to the failure. Engineers, once they know the location of the bug, can make small changes, such as changing the timing of instructions, to keep the error from recurring.

Mitra has collaborated with Intel and tested IFRA on its Core i7 chips, which can have two to six cores, and found that IFRA can generally locate 96 percent of bugs and pinpoint 80 percent with their exact time and location. A description of the technique appeared in the February issue of Communications of the ACM.

In a perspective piece written about the work, Shekhar Borkar, the director of the microprocessor technology lab at Intel, writes that he “would not be surprised if [the researchers’ approach] catches on quickly.”

Borkar adds that it has “great potential to go further and help debug multicores, memory systems, analog circuits, and even complex SOCs [systems on chips].”

However, it’s still early days for the research, says Rutenbar. The main concern is balancing the amount of hardware used to track bugs with the number of problems that can be found. Still, he says, “I think the IFRA stuff is very good work.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.