As microprocessors get smaller and more intricate, finding the hardware bugs that can cause a computer to crash requires more time, money, and engineering effort. But now engineers at Stanford University have proposed a shortcut that could help locate bugs in a fraction of the time.
Debugging normally involves putting a chip through a battery of tests to identify spots that are likely to fail and to give engineers a chance to fix problems before the chips go into mass production. As chip-making companies as push the functionality of their hardware, this becomes increasingly complicated.
Subhasish Mitra, professor of electrical engineering and computer science at Stanford and colleagues have developed a method that uses a small number (about 1 percent) of the transistors on a chip to record a log of chip activity–the instructions that pass through the chip’s circuits. This log can be extracted from the chip, dumped into a computer, and analyzed to find out where the bugs are.
“It’s enormously expensive to diagnose where chips are failing,” says Rob Rutenbar, professor of computer science at the University of Illinois, who wasn’t involved with the research. As the features on microprocessors get smaller, Rutenbar says, “people worry more about wear-out and reliability issues.”
Engineers test for bugs throughout the making of a chip. First, they scour the designs to find any so-called functional or logic errors. Then, after the designs have been etched into silicon, engineers look for bugs that crop up under operating conditions such as playing video or browsing the Web. This process is called post-silicon debugging, and 30 to 40 percent of the time and money spent on making a new chip by companies like Intel and AMD is allotted to post-silicon debugging, says Mitra.
During the post-silicon phase, engineers pulse electrical signals through the chip, mimicking the electrical activity seen during normal operation. If a chip fails during these tests, engineers try to re-create the electrical signals that caused the problem. Next, they try to pinpoint the exact set of instructions and conditions responsible for the failure. But simulation takes time: a single second in silicon can be equivalent to days of simulation, says Mitra. Moreover, many of the errors occur due to operating temperatures and workloads that are difficult to re-create. “The trouble is, the whole electrical state of the system changes,” Mitra says.
So Mitra and Stanford graduate student Sung-Boem Park decided to catch evidence of the bugs while they happen, eliminating much of the time spent doing electrical simulations. The challenge was finding the right way to record information about chip instructions without using too many transistors and without storing too much information. To do this, they built recording devices, or buffers, into chips. This is not a new idea. In fact, almost any kind of commercially available chip today has a small number of transistors whose job is to hold small amounts of data about chip activity–to ensure, for instance, that operations are synchronized across the chip.