How to program unreliable chips

A new language lets coders reason about the trade-off between fidelity of execution and power or time savings in the computers of the future.


Press Contact

Andrew Carleen
Email: expertrequests@mit.edu
Phone: 617-253-1682
MIT News Office

As transistors get smaller, they also become less reliable. So far, computer-chip designers have been able to work around that problem, but in the future, it could mean that computers stop improving at the rate we’ve come to expect.

A third possibility, which some researchers have begun to float, is that we could simply let our computers make more mistakes. If, for instance, a few pixels in each frame of a high-definition video are improperly decoded, viewers probably won’t notice — but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency.

In anticipation of the dawning age of unreliable chips, Martin Rinard’s research group at MIT’s Computer Science and Artificial Intelligence Laboratory has developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it’s intended.

“If the hardware really is going to stop working, this is a pretty big deal for computer science,” says Rinard, a professor in the Department of Electrical Engineering and Computer Science. “Rather than making it a problem, we’d like to make it an opportunity. What we have here is a … system that lets you reason about the effect of this potential unreliability on your program.”

Last week, two graduate students in Rinard’s group, Michael Carbin and Sasa Misailovic, presented the new system at the Association for Computing Machinery’s Object-Oriented Programming, Systems, Languages and Applications conference, where their paper, co-authored with Rinard, won one a best-paper award.

On the dot

The researchers’ system, which they’ve dubbed Rely, begins with a specification of the hardware on which a program is intended to run. That specification includes the expected failure rates of individual low-level instructions, such as the addition, multiplication, or comparison of two values. In its current version, Rely assumes that the hardware also has a failure-free mode of operation — one that might require slower execution or higher power consumption.

A developer who thinks that a particular program instruction can tolerate a little error simply adds a period — a “dot,” in programmers’ parlance — to the appropriate line of code. So the instruction “total = total + new_value” becomes “total = total +. new_value.” Where Rely encounters that telltale dot, it knows to evaluate the program’s execution using the failure rates in the specification. Otherwise, it assumes that the instruction needs to be executed properly.

Compilers — applications that convert instructions written in high-level programming languages like C or Java into low-level instructions intelligible to computers — typically produce what’s called an “intermediate representation,” a generic low-level program description that can be straightforwardly mapped onto the instruction set specific to any given chip. Rely simply steps through the intermediate representation, folding the probability that each instruction will yield the right answer into an estimation of the overall variability of the program’s output.

“One thing you can have in programs is different paths that are due to conditionals,” Misailovic says. “When we statically analyze the program, we want to make sure that we cover all the bases. When you get the variability for a function, this will be the variability of the least-reliable path.”

“There’s a fair amount of sophisticated reasoning that has to go into this because of these kind of factors,” Rinard adds. “It’s the difference between reasoning about any specific execution of the program where you’ve just got one single trace and all possible executions of the program.”

Trial runs

The researchers tested their system on several benchmark programs standard in the field, using a range of theoretically predicted failure rates. “We went through the literature and found the numbers that people claimed for existing designs,” Carbin says.

With the existing version of Rely, a programmer who finds that permitting a few errors yields an unacceptably low probability of success can go back and tinker with his or her code, removing dots here and there and adding them elsewhere. Re-evaluating the code, the researchers say, generally takes no more than a few seconds.

But in ongoing work, they’re trying to develop a version of the system that allows the programmer to simply specify the accepted failure rate for whole blocks of code: say, pixels in a frame of video need to be decoded with 97 percent reliability. The system would then go through and automatically determine how the code should be modified to both meet those requirements and maximize either power savings or speed of execution.

“This is a foundation result, if you will,” says Dan Grossman, an associate professor of computer science and engineering at the University of Washington. “This explains how to connect the mathematics behind reliability to the languages that we would use to write code in an unreliable environment.”

Grossman believes that for some applications, at least, it’s likely that chipmakers will move to unreliable components in the near future. “The increased efficiency in the hardware is very, very tempting,” Grossman says. “We need software work like this work in order to make that hardware usable for software developers.”


Topics: Computer Science and Artificial Intelligence Laboratory (CSAIL), Computing, Program verification, Computer chips

Comments

From a product perspective, I wonder who in their right mind will use components which are deemed to fail soon. If anything, it will only encourage people to make expensive and faulty chips and milk customers when they want to replace them. It will also possibly add to the e-waste problem out there. It's not innovative, it's draconian. If you're doing probabilities, try to work on improving the probability of better chips in the fab.

We already have unreliable chips, they are called NAND flash, and there are ways to detect bad blocks, and mark them as such.

However, we are probably talking about CPUs here, so I wonder how we'd handle if the CPU returns 2 + 2 = 5 in a somewhat random manner. That probably means some part of the chip should be 100% accurate, whereas other parts may be allowed to return inaccurate results. Maybe in an SoC the CPU would be accurate, but the GPU not so, without bad consequences. Except people start using GPU for computing now...

As the article stated wisely, this is matter of requirements, the FAA. FDA, NHTSA and DoD will approve neither non-reliable hardware nor hardware. You do not want to step on your car's breaks just to identify they do not function,

whatever the reason might be.

The article did not mention what might cause hardware to turn inaccurate. in it implied that it's "overclocking" might be the reason but this is not accurate to say the least and high frequency means more power consumption,, not as implied in the article

The example on the video is not in place for many reasons- this is not the place to specify them, look for BER, DSP, keyword-detection etc.

It might be wise to distinguish between "fuzzy logic" system, as systems which the return values in not guaranteed and ... bug. like the one forced Intel to recall and replace CPU for 1 in a billion error.

The issue is not new - design of processors aimed to be exposed to radiation takes into account change in the silicon logic.

Back to the top