Automatic bug repair

System fixes bugs by importing functionality from other programs — without access to source code.


Press Contact

Abby Abazorius
Email: abbya@mit.edu
Phone: 617-253-2709
MIT News Office

Media Resources

1 images for download

Access Media

Media can only be downloaded from the desktop version of this website.

At the Association for Computing Machinery’s Programming Language Design and Implementation conference this month, MIT researchers presented a new system that repairs dangerous software bugs by automatically importing functionality from other, more secure applications.

Remarkably, the system, dubbed CodePhage, doesn’t require access to the source code of the applications whose functionality it’s borrowing. Instead, it analyzes the applications’ execution and characterizes the types of security checks they perform. As a consequence, it can import checks from applications written in programming languages other than the one in which the program it’s repairing was written.

Once it’s imported code into a vulnerable application, CodePhage can provide a further layer of analysis that guarantees that the bug has been repaired.

“We have tons of source code available in open-source repositories, millions of projects, and a lot of these projects implement similar specifications,” says Stelios Sidiroglou-Douskos, a research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) who led the development of CodePhage. “Even though that might not be the core functionality of the program, they frequently have subcomponents that share functionality across a large number of projects.”

With CodePhage, he says, “over time, what you’d be doing is building this hybrid system that takes the best components from all these implementations.”

Sidiroglou-Douskos and his coauthors — MIT professor of computer science and engineering Martin Rinard, graduate student Fan Long, and Eric Lahtinen, a researcher in Rinard’s group — refer to the program CodePhage is repairing as the “recipient” and the program whose functionality it’s borrowing as the “donor.” To begin its analysis, CodePhage requires two sample inputs: one that causes the recipient to crash and one that doesn’t. A bug-locating program that the same group reported in March, dubbed DIODE, generates crash-inducing inputs automatically. But a user may simply have found that trying to open a particular file caused a crash.

Carrying the past

First, CodePhage feeds the “safe” input — the one that doesn’t induce crashes — to the donor. It then tracks the sequence of operations the donor executes and records them using a symbolic expression, a string of symbols that describes the logical constraints the operations impose.

At some point, for instance, the donor may check to see whether the size of the input is below some threshold. If it is, CodePhage will add a term to its growing symbolic expression that represents the condition of being below that threshold. It doesn’t record the actual size of the file — just the constraint imposed by the check.

Next, CodePhage feeds the donor the crash-inducing input. Again, it builds up a symbolic expression that represents the operations the donor performs. When the new symbolic expression diverges from the old one, however, CodePhage interrupts the process. The divergence represents a constraint that the safe input met and the crash-inducing input does not. As such, it could be a security check missing from the recipient.

CodePhage then analyzes the recipient to find locations at which the input meets most, but not quite all, of the constraints described by the new symbolic expression. The recipient may perform different operations in a different order than the donor does, and it may store data in different forms. But the symbolic expression describes the state of the data after it’s been processed, not the processing itself.

At each of the locations it identifies, CodePhage can dispense with most of the constraints described by the symbolic expression — the constraints that the recipient, too, imposes. Starting with the first location, it translates the few constraints that remain into the language of the recipient and inserts them into the source code. Then it runs the recipient again, using the crash-inducing input.

If the program holds up, the new code has solved the problem. If it doesn’t, CodePhage moves on to the next candidate location in the recipient. If the program is still crashing, even after CodePhage has tried repairs at all the candidate locations, it returns to the donor program and continues building up its symbolic expression, until it arrives at another point of divergence.

Automated future

The researchers tested CodePhage on seven common open-source programs in which DIODE had found bugs, importing repairs from between two and four donors for each. In all instances, CodePhage was able to patch up the vulnerable code, and it generally took between two and 10 minutes per repair.

As the researchers explain, in modern commercial software, security checks can take up 80 percent of the code — or even more. One of their hopes is that future versions of CodePhage could drastically reduce the time that software developers spend on grunt work, by automating those checks’ insertion.

“The longer-term vision is that you never have to write a piece of code that somebody else has written before,” Rinard says. “The system finds that piece of code and automatically puts it together with whatever pieces of code you need to make your program work.”

“The technique of borrowing code from another program that has similar functionality, and being able to take a program that essentially is broken and fix it in that manner, is a pretty cool result,” says Emery Berger, a professor of computer science at the University of Massachusetts at Amherst. “To be honest, I was surprised that it worked at all.”

“The donor program was not written by the same people,” Berger explains. “They have different coding standards; they name variables differently; they use all kinds of different variables; the variables could be local; or they could be higher up in the stack. And CodePhage is able to identify these connections and say, ‘These variables correlate to these variables.’ Speaking in terms of organ donation, it transforms that code to make it a perfect graft, as if it had been written that way in the beginning. The fact that it works as well as it does is surprising — and cool.”


Topics: Research, School of Engineering, Computer science and technology, Computer Science and Artificial Intelligence Laboratory (CSAIL), Electrical engineering and computer science (EECS)

Comments

Incredible, just like magic ! I wonder if it's available for public use ?

Cool! Do we have any chance to try it?

cool

That sounds really cool :)

Could this be used for malicious purposes? Like can this same type of program be used to find vulnerabilities and exploit them? Like, say, code in anti-virus software? :)

So does it patch only when you have precompiled source code? Or can it patch compiled binaries?

biology analog? ... trading sequences of DNA over generations?

Well that's it... they just invented SkyNet as far as I am concerned. :-P

Will it be open-sourced soon?

can I try any version, to check the result , it sounds like very powerful

Probably a recipe for disaster. Code can predict what you intended to do?

Currently lots of research is going on this topic to find out the effects of changes on the whole system. This will definitely help people discovering new patterns of refactoring...

Cool,Can we try it out

They could use a similar approach to database queries where all of the possible queries were already made ahead of time, therefore it's easier to review the value of those queries internally, externally.....We could also use that to further irrigate memory allocations like water in a field by putting certain memories in certain places...We can rethink memory allocation to trade some efficiency for solid security........AND perhaps we match them to shortcuts without a brain crunch to figure out potential queries........Better hardware will be entirely more dynamic and perhaps codependent rather than stand alone. I envisioned as one answer to passwords getting broke easily as a blend of analog, digital to make a password easier to remember, more difficult to break by assigning weights to characters, putting those weight references in your pocket for quick reference....probably could 3D print such a device and change the weights less frequently than a password...somewhere in a computer there is a place for language dialect to be used effectively...password attempts probably can be inverted...let the burglar in and give them the opportunity to make bad commands first, set a high bar for correct commands before they are actually executed, avoid a lockdown, reset...partner debugging could backfire in reverse though.

This is the end-times for code analysis based on comments. If "crashed" can be defined as any arbitrary state, the donor analysis trigger becomes "reaches that state (or not)" where any given input case result might be sought OR avoided... This will almost immediately be paired up with goal-seeking genetic algorithms, I suspect.

On what prog lang it written?

on what language it written?

The future of "coding" has a very short lifespan in front of it. Software development will evolve way past "coding" the same way that nearly no one interacts with assembly moving bits around registers. The majority of all code blocks already exist and it all comes down to sequencing and managing a software's timeline.

Alternate intelligence as a whole grants sustenance to those with small or no imagination.

Back to the top