Researchers at the University of Michigan have developed software that hunts for flaws in chips and proposes the best way to fix them. Their approach tackles a growing problem for chip makers such as AMD and Intel. As transistors shrink and chips acquire more-complex designs, hardware bugs are becoming more prevalent. Currently, it can take up to a year to debug prototype chips and get them ready for mass production. The new software could shorten the time it takes to get a chip to market, cut costs by reducing the number of prototypes and testing cycles, and ultimately yield chips with fewer flaws.
“This is still an unsolved problem,” says Rob Rutenbar, a professor of electrical and computer engineering at Carnegie Mellon University, who adds that there is very little scientific literature on debugging silicon. “Intel might have some sophisticated technology, but they’re not talking about it. For all we know, people are doing it by hand,” Rutenbar says. “The sense that I get is that it’s not very well automated.”
Debugging by hand leaves more room for error. “Pretty much all chips, including microprocessors, are buggy,” says Igor Markov, a professor of electrical engineering and computer science at the University of Michigan. Intel’s website, for instance, lists about 130 known hardware bugs on commercial laptops. Most can be fixed with software downloads, but about 20 of them can’t be, Markov says, and they leave machines vulnerable to viruses.
Markov and his colleague Valeria Bertacco, professor of electrical engineering and computer science at Michigan, developed software that tackles the bug-fixing problem after the first round of prototypes has come back to the chip maker. “When you have a first version of a chip, it’s not ready to give to the consumer,” says Bertacco. Engineers need to try to run operating systems and software on it to see if it works, and this process can take anywhere from a couple of hours to a week, depending on the number of flaws in the chips.
“It’s very hard to figure out what is wrong,” Bertacco says. And once an engineer has identified a bug–which can be anything from wires spaced too closely together to misplaced transistors–it’s not always clear what the best fix will be. Often, engineers repair one problem only to discover in the next round of prototypes that their solutions have inadvertently added other flaws. Prototypes can take months to build, and they are expensive: changing the designs on the masks used to pattern layers of transistors and wires on the chips costs millions of dollars.
Currently, when a prototype comes back to a chip maker, engineers hook it up to electrical probes that send electrical signals through it and record the output, explains Bertacco. Different signals go to different parts of the chip, and by trying out thousands of signals, engineers can usually locate a problem. Then they propose a series of possible solutions. Sometimes they simply need to remove a connection between two wires in one of the upper layers of the chip. This can be done using equipment readily available in the lab, and the chip can quickly be retested. Other times, fixes are needed at lower layers within the chip, where the transistors make up logic gates. These transistors can’t be so easily adjusted and retested.
The Michigan researchers wrote software that automatically specifies the electrical input to chips being tested and analyzes their output to find problem areas. Ideally, engineers would want to know the output of each transistor on a chip. But consumer chips will soon have more than a billion transistors, which will make such precise testing far too time consuming, explains Bertacco. So the Michigan algorithm tests a number of inputs across a large portion of the chip. Based on the output errors, it knows which part of the chip to concentrate on, “narrowing down a search to a few promising candidate bugs,” says Bertacco. In a similar manner, the software identifies ways to fix the bugs, running through a series of simulations to find a design variation that offers the fastest and most cost-effective solution.
One of the big advantages of the Michigan researchers’ approach, says Rutenbar, is that their software can sometimes come up with counterintuitive solutions. An engineer, he says, might see that the logical way to fix a bug is to rewire a number of circuits. But the software can tell when flipping a few wires will get the same result. “When humans look at it, it’s not at all obvious,” Rutenbar says.
In case studies, the researchers showed that their software can automatically repair about 70 percent of major silicon bugs, and they claim that they could reduce the amount of time required to find a particular bug from weeks to days.
Intel is keeping an eye on the work, as it is always looking for better ways to improve the chip-making process. Debugging silicon is a “serious problem,” says Shekhar Borkar, an Intel research fellow. He says that Intel uses “the same kind of techniques” that the Michigan researchers do, “but maybe in a different form.” Borkar adds that “there are some advances in the [Michigan] paper.” He says that the Michigan research is a good start to solving the problem but still needs to be proved outside the lab.