The Stanford researchers’ approach, called instruction footprint recording and analysis (IFRA), was designed to collect just the right amount of information about the chip’s activity at just the right time. As trillions of instructions stream through a chip, information describing those instructions pass through so-called “circular buffers,” containers that hold information for a short time before being refreshed. When a failure or hint of an impending failure is detected, the system stops recording in the circular buffers and saves the buggy instructions.
When a chip fails, data that represents the chip’s activity has been transferred to a computer. Software developed by the researchers decodes the labels, laying out the instructions–and the corresponding location on the chip–that led to the failure. Engineers, once they know the location of the bug, can make small changes, such as changing the timing of instructions, to keep the error from recurring.
Mitra has collaborated with Intel and tested IFRA on its Core i7 chips, which can have two to six cores, and found that IFRA can generally locate 96 percent of bugs and pinpoint 80 percent with their exact time and location. A description of the technique appeared in the February issue of Communications of the ACM.
In a perspective piece written about the work, Shekhar Borkar, the director of the microprocessor technology lab at Intel, writes that he “would not be surprised if [the researchers’ approach] catches on quickly.”
Borkar adds that it has “great potential to go further and help debug multicores, memory systems, analog circuits, and even complex SOCs [systems on chips].”
However, it’s still early days for the research, says Rutenbar. The main concern is balancing the amount of hardware used to track bugs with the number of problems that can be found. Still, he says, “I think the IFRA stuff is very good work.”