What do a gothic cathedral and a cactus have in common?
Not much, at first glance. But when a computer confronts the enormously complex task of identifying images, it can be fooled by the grayish-green color, spiky protrusions and vertical orientation of these two objects.
With the help of two researchers at the Artificial Intelligence Laboratory, computers are becoming better at finding a match for images. This ability may someday make it possible to plug into an Internet search engine and say, "find me a sunset" or "give me Clark Gable" or "find me a picture that looks like this one."
Professor Eric L. Grimson and Assistant Professor Paul Viola of electrical engineering and computer science focus on computer vision and machine learning using techniques from statistics and information theory.
While their primary focus is to further our understanding of how humans and machines "see," they and others have come up with many possible uses for the technology. Users might be able to quickly peruse real estate listings for a certain kind of house; look through catalogs for just the right couch, suit or wallpaper; search photo archives for the picture of the girl screaming over the body at Kent State; or compare the physical structures of proteins used in the biotechnology industry.
Professor Grimson is working with the US Office of Patents and Trademarks on a way to bypass the tedious process of hand-checking trademark applications against drawerfuls of existing trademarks.
FISHING IN A SEA OF IMAGES
There are an estimated 50-100 million images on the web. An additional 17 million are being added as huge historical archives of news photos are digitized and made available to Internet users.
Right now, you can't say to your average computer, "Find me a gothic cathedral" or "I need that photo of the flag being raised on Iwo Jima."
"The key is to figure out how people see images and how we can model that," Professor Viola said.
He and Professor Grimson approach the task from different angles. Professor Viola is interested in mimicking how the human brain processes images.
The brain's initial processing of an image happens in the visual cortex at the back of the brain. It decides in broad strokes whether it is brighter on the top or bottom, on the right or left. The more complex decisions about colors and features come into play in a different part of the brain where cells are responsive to very complex spatial relations.
Similarly, Professor Viola's approach is based on the idea that if two images are correctly matched, they have a lot of information in common. When you give the computer two similar images and ask it to find matches, it uses the correlating aspects of the two examples.
Professor Viola has the computer scan two or three similar images to determine areas of brightness, orientation, shape and color: it has a total of 49,000 "tricks" that it can potentially use to classify a picture. A sunset, for instance, would look like an area of brightness over a darker region, a bright spot of sun striped by clouds.
It checks its database of 5,000 images for matches that apply to both of the pair of given images. It then narrows its scope to a few hundred tricks, which it then uses to pick about 20 matches.
The computer's picks can be amazingly close. If the two first images are cars, most of the pictures you get back are cars. A few might be airplanes (they both have elongated bodies and wheels) and a couple are seemingly unrelated, like a sunset. A closer look, though, can usually tell you how the computer decided on even the poor matches.
"It's not unlike working with a text search engine," Professor Viola said. "Some matches are exactly what you want and others aren't even close. It's fun to see what it comes up with."
FINE-TUNING PERCEPTION
If written descriptions existed for every image, a user could use words to search the Internet or a database. But there's no guarantee that one person would use the same word or words to describe a picture as another person.
"Our goal is to capture what's in an image that really describes the content to find images that are similar to it," Professor Grimson said.
His trademark program is remarkably successful. If a user gives it the CBS eye logo, it will come back with eyes as well as other images that contain the same basic stylized elements.
If the users gives it a fancy letter "B," it will find similar letters. Its strength also lies in finding other B-shaped designs that a person may not characterize as letters. For instance, its "B" search turns up an image in which the top half of the shape is a face in profile. The computer is good at matching images where the look is the same while the content is not.
"You can even give it sketches," Professor Grimson said.
His program is based on studies on human perception of images. Identifying images is something people do exceptionally well. We can look at blurred images, tiny images, distorted images or pictures taken in different light and still identify them.
To get the computer to mimic this ability, Professor Grimson has it look for segments with something in common, like color or brightness. He then has it compare each of these commonalities to images in the database. The more it has to warp its parameters to get it to match a new image, the less accurate the match.
This same system can be used successfully with faces. Professors Grimson and Viola also are beginning to work on applying their techniques to video sequences.
They say there are ways of "teaching" the computer to improve its success rate. If you showed it pictures of cathedrals, for instance, and labeled them in the machine's memory, it would be better at picking out a picture of a cathedral. These computers are capable of learning because they can improve their performance by getting positive feedback for a good match.
The Learning and Vision Group at the Artificial Intelligence Laboratory focuses on issues and applications in machine learning. Group members apply learning methods to problems in vision that are applicable to a wide variety of domains, including information retrieval, event prediction, feature discovery, coincidence detection, function learning and optimization, as well as vision-related problems such as image alignment, object recognition and object tracking.
Professors Grimson and Viola's work is funded primarily through the Office of Naval Research and the Advanced Research Products Agency.
A version of this article appeared in MIT Tech Talk on March 11, 1998.