CAMBRIDGE, Mass.--What do a gothic cathedral and a cactus have in common?
Not much, at first glance. But when a computer tangles with the enormously complex task of identifying images, it can be fooled by the grayish-green color, spiky protrusions and vertical orientation of cathedrals and cactuses.
With the help of two researchers at the MIT Artificial Intelligence Laboratory, computers are becoming better at finding a match for images. This ability may someday make it possible to plug into an Internet search engine and say, "Find me a sunset," or "Give me Clark Gable," or "Find me a picture that looks like this one."
Eric L. Grimson, professor of electrical engineering and computer science, and Paul Viola, assistant professor of electrical engineering and computer science, focus on computer vision and machine learning using techniques from statistics and information theory.
While their primary focus is to further our understanding of how humans and machines "see," they and others have come up with many possible uses for the technology.
We might be able to quickly peruse real estate listings for a certain kind of house; look through catalogs for just the right couch, suit or wallpaper; search photo archives for the picture of the girl screaming over the body at Kent State; or compare the physical structures of proteins used in the biotechnology industry.
Grimson is working with the U.S. Office of Patents and Trademarks on a way to bypass the tedious process of hand-checking trademark applications against drawerfuls of existing trademarks.
Fishing in a sea of images
There are an estimated 50 to 100 million images on the web. An additional 17 million are being added as huge historical archives of news photos are digitized and made available to Internet users.
Right now, you can't say to your average computer, "Find me a gothic cathedral" or "I need that photo of the flag being raised on Iwo Jima."
"The key is to figure out how people see images and how we can model that," Viola said.
Viola and Grimson approach the task from different angles. Viola is interested in mimicking how the human brain processes images.
The brain's initial processing of an image happens in the visual cortex at the back of the brain. It decides in broad strokes whether it is brighter on the top or bottom, to the right or left. The more complex decisions about colors and features come into play in a different part of the brain where cells are responsive to very complex spatial relations.
Similarly, Viola's approach is based on the idea that if two images are correctly matched, they have a lot of information in common. When you give the computer two similar images and ask it to find matches, it uses the correlating aspects of the two examples.
Viola has the computer scan two or three similar images to determine areas of brightness, orientation, shape, color: it has a total of 49,000 "tricks" that it can potentially use to classify a picture. A sunset, for instance, would look like an area of brightness over a darker region, a bright spot of sun striped by clouds.
It checks its database of 5,000 images for matches that apply to both of the pair of given images. It then narrows its scope to a few hundred tricks, which it then uses to pick about 20 matches.
Its picks can be amazingly close. If the two first images are cars, most of the pictures you get back are cars. A few might be airplanes (they both have elongated bodies and wheels) and a couple are seemingly unrelated, like a sunset. At closer look, though, can usually tell you how the computer decided on even the poor matches.
"It's not unlike working with a text search engine," Viola points out. "Some matches are exactly what you want and others aren't even close. It's fun to see what it comes up with."
If written descriptions existed for every image, you could use words to search the Internet or a database for what you want. But there's no guarantee that one person would use the same word or words to describe a picture as another person.
"Our goal is to capture what's in an image that really describes the content to find images that are similar to it," Grimson said.
His trademark program is remarkably successful. If you give it the CBS eye logo, it will come back with eyes as well as other images that contain the same basic stylized elements.
If you give it a fancy letter "B," it will find similar letters. Its strength also lies in finding other B-shaped designs that a person may not characterize as letters. For instance, its "B" search turns up an image in which the top half of the shape is a face in profile. The computer is good at matching images where the look is the same, while the content is not.
"You can even give it sketches," Grimson said.
His program is based on studies on human perception of images. Identifying images is something people do exceptionally well. We can look at blurred images, tiny images, distorted images, pictures taken in different light and still identify them.
To get the computer to mimic this ability, Grimson has it look for segments with something in common, like color or brightness. He then has it compare each of these commonalities to images in the database. The more it has to warp its parameters to get it to match a new image, the less accurate the match.
This same system can be used succesfully with faces. Grimson and Viola also are beginning to work on applying their techniques to video sequences.
Viola and Grimson say there are ways of "teaching" the computer to improve its success rate. If you showed it pictures of cathedrals, for instance, and labeled them in the machine's memory, it would be better at picking out a picture of a cathedral. These computers are capable of learning because they can improve their performance by getting positive feedback for a good match.
The Learning and Vision Group at the MIT Artifical Intelligence Lab focuses on issues and applications in machine learning. The group applies learning methods to problems in vision that are applicable to a wide variety of domains, including information retrieval, event prediction, feature discovery, coincidence detection, function learning and optimization, as well as vision-related problems such as image alignment, object recognition and object tracking.
Professors Grimson and Viola's work is funded primarily through through the Office of Naval Research and DARPA.