Skip to content ↓

Realistic animation of human face makes simulated talking look real

This image illustrates how MIT researchers artificially animate a human face. The top row is a real background sequence of Mary 101 recorded by the researchers. The middle row is an animation of the synthetic mouth generated in the lab. The bottom row shows the synthetic mouth animation superimposed on the background sequence.
This image illustrates how MIT researchers artificially animate a human face. The top row is a real background sequence of Mary 101 recorded by the researchers. The middle row is an animation of the synthetic mouth generated in the lab. The bottom row shows the synthetic mouth animation superimposed on the background sequence.
Image courtesy / Tony F. Ezzat

Mary 101's face belongs to a real person, but her image is now a video ventriloquist's dummy. MIT researchers Tomaso Poggio and Tony F. Ezzat can make her say anything they want.

To date, artificially animated human faces have looked jerky and unrealistic. Poggio, an investigator with the McGovern Institute for Brain Research, and Ezzat, a graduate student in electrical engineering and computer science, have simulated mouth movements that look so real, most viewers can't tell that Mary 101 isn't an ordinary videotape of a person speaking.

Given a few minutes of footage of any individual, the researchers can pair virtually any audio to any videotaped face, matching mouth movements to the words.

Poggio can imagine a future in which a celebrity such as Michael Jordan may sell his image and the right to create a virtual video version of himself for advertising and other purposes. Or maybe the estates of John Wayne, Marilyn Monroe or Elvis Presley would be willing to have the performers make a virtual comeback--for a price.

A more realistic scenario for the near future would be webcasts or TV newscasts incorporating the face of a model like Mary 101 that has been programmed to give weather updates or read the day's winning lottery number. At its present level of development, "You wouldn't want to look at this (face) for too long," Ezzat says. While the mouth is extremely lifelike, the overall effect tends to look emotionless after a couple of minutes because eye and head movements don't exactly match the words. (The researchers are working on that.)

"This (human animation) technique is inevitable--it's just another step in progress that has happened over the last several years," said Poggio, the Uncas and Helen Whitaker Professorship of Vision Sciences and Biophysics in the Department of Brain and Cognitive Sciences.

Poggio, who also is affiliated with the Artificial Intelligence Laboratory and is co-director of the Center for Biological and Computational Learning, investigates learning theories that can be applied to understanding the brain and building intelligent machines. He is interested in developing techniques that allow computers to learn from experience. Applications range from classifying text to analyzing genetic data and creating artificial financial markets.

Supervised learning, or computers learning from examples, has not previously been applied to computer graphics. "We're testing a new paradigm in learning, which is novel for computer graphics. No one has tried this before," Poggio said.


Poggio says, among other things, this work could improve the man-machine interface by putting a "real" face on computer avatars. Instead of the unrealistic, cartoon-like images that now exist, computerized people could become much more lifelike. This has applications in the business realm, entertainment, speech therapy and in teaching a foreign language through a computerized tutor.

Although computer bandwidth is rapidly increasing, one possible use is for video email and videophones that work over relatively low bandwidths. You could use existing video of a person's face, transmit his or her voice alone and use computer animation techniques to have the mouth shape the new words.

This method could be used for redubbing a film from one language to another, negating the need for subtitles. It would also avoid Japanese film syndrome where the actors' lips are still moving long after the shorter dubbed English phrase has been uttered.

It would be useful for tasks such as eye-tracking, facial expression recognition, visual speech estimation and pose estimation.

Poggio has applied similar computer-learning techniques to automobiles that can automatically detect and protect pedestrians; face recognition systems, man-machine expression; artificial financial markets, a tool that helps understand customers without conducting large market surveys; and classifying genetic data. Certain cancers are difficult for pathologists to identify, but have a known genetic signature. The computer can check the patient's DNA to pinpoint the type of cancer.


Since Ezzat graduated from MIT five years ago with an undergraduate and master's degrees in electrical engineering and computer science, he has been grappling with the problem of how to create a realistic computer model of a real human face.

Unlike previous efforts that relied on 3D computer modeling techniques, Ezzat and Poggio built a computer model that uses example video footage of the person talking.

For the Mary 101 project, Ezzat used the facilities of MIT Video Production Services to videotape model Mary 101 speaking for eight minutes. He gathered 15,000 individual digitized images of her face. He then developed software that allowed the computer to automatically choose a small set of images that covered the range of Mary 101's mouth movements.

The computer takes these mouth images and recombines them in a high-dimensional "morph space." Using a learning algorithm in the morph space, the computer is able to figure out from the original video footage how Mary 101's face moves. This allows the software to re-synthesize new utterances.

"The work is still in its infancy, but it proves the point that we can take existing video footage and re-animate it in interesting ways," Ezzat said. "At this point, we can re-animate realistic speech, but we still need to work on re-animating emotions. In addition, we cannot handle footage with profile views of a person, but we are making progress toward addressing these issues."

The video produced by the system was evaluated using a special test to see whether human subjects could distinguish between real sequences and synthetic ones. Gadi Geiger, a researcher in the Center for Biological and Computational Learning who is working with Ezzat, showed that people could not distinguish the real video from the synthetic sequences generated by the computer.

Although the researchers work only with video, using audio supplied from different sources, they are starting a collaboration with MIT's Spoken Language Systems Group in the Laboratory for Computer Science, in which they will tackle other issues as well. One of the goals of the project, beyond entertainment and virtual actors, is to develop a computerized tutor that can interact in a natural way with its pupils. This is part of the broader agenda of the McGovern Institute that includes studying and improving communication and understanding ideas and feelings.

Ultimately, Poggio is interested in furthering understanding of the human brain and creating machines that think more like people.

"Understanding the problem of learning in the brain is key if you want to reproduce intelligence in machines. This is not just understanding the mechanisms of memory storage, but understanding the processes that allow us to learn complex strategies and perceptions," he said. "Learning is arguably at the very core of the problem of intelligence, both biological and artificial. Because seeing is intelligence, this work also is key to understanding the brain's visual cortex."

This work is funded by the National Science Foundation and NTT through the NTT-MIT Research Collaboration.

A version of this article appeared in MIT Tech Talk on May 22, 2002.

Related Topics

More MIT News