Improved Visual Search

Researchers are trying to make computers see as we do.

Neil Savagearchive page

May 25, 2006

Search engines work wonderfully when you want to find something in a long stretch of text. Just type in a word or phrase, and the computer quickly scans through a Web page or Word document and picks it out. But for a computer to do the same thing with an image – find a particular person or object somewhere in a video recording, for instance – is much more difficult. Whereas a human eye instantly distinguishes a tree from a cat, it’s a lot of work to teach a computer to do the same.

That challenge is being tackled by researchers at MIT’s Center for Biological and Computational Learning (CBCL), led by Tomaso Poggio, the Eugene McDermott Professor in the Brain Sciences and Human Behavior. Some students at the center are proposing software that could work, say, with surveillance cameras in an office building or military base, eliminating the need for a human to watch monitors or review videotapes. Other applications might automate computer editing of home movies, or sort and retrieve photos from a vast database of images. It might also be possible to train a computer to perform preliminary medical diagnoses based on an MRI or CT scan image.

But the work to make such exciting applications possible is daunting. “The fact that it seems so easy to do for a human is part of our greatest illusion,” says Stanley Bileschi, who this month earned his PhD in electrical engineering and computer science at the CBCL. Processing visual data is computationally complex, he says, noting that people use about 40 percent of their brains just on that task. There are many variables to take into account when identifying an object: color, lighting, spatial orientation, distance, and texture. And vision both stimulates and is influenced by other brain functions, such as memory and reasoning, which are not fully understood. “Evolution has spent four billion years developing vision,” Poggio says.

Scientists have traditionally used statistical learning systems to teach computers to recognize objects. In such systems, a scientist tells a machine that certain images are faces, then tells it that certain other images are not faces. The computer examines the images pixel by pixel to figure out, statistically, what the face images have in common that the nonface images do not.

For instance, it might learn that a set of pixels representing the brow is a brighter than the pixels representing the pupils, and that the two sets are a standard distance apart. It might notice that the mouth tends to be horizontal, and that there is a sharp change in brightness where the head stops and the background begins. Once trained, it can look at new images and see how closely they match the rules.

Although this method can achieve limited results, it does not represent the way the brain processes images. Neurophysiologists at the CBCL are studying how, exactly, the brain does its visual work. They note how each pixel in an image stimulates a photoreceptor in the eye, for instance, based on the pixel’s color value and brightness: each stimulus leads neurons to fire in a particular pattern.

The programmers make a mathematical model of those patterns, tracking which neurons fire (and how strongly) and which don’t. They tell the computer to reproduce the right pattern when it sees a particular pixel, and then they train the system with positive and negative examples of objects. This is a tree, and this is not.

But instead of learning about the objects themselves, the computer learns the neuron stimulation pattern for each type of object. (Essentially, it’s learning patterns of patterns: the patterns of neural reactions not just to pixels but to groupings of pixels.) Later, when it sees a new image of a tree, it will see how closely the resulting neuron pattern matches the ones produced by other tree images. Poggio says this is similar to the way a baby’s brain gets imprinted with visual information and learns about the world around it.

The researchers applied standard tests to the system and found that it can detect people and cars in a street scene about 95 to 98 percent of the time, Bileschi says. The system doesn’t just identify objects; it can view stills or video and recognize an action. It might recognize running, for instance, based on how a leg is bent or how quickly a person shifts position from one frame to the next.

David Lowe, professor of computer science at the University of British Columbia, says many researchers in the field of object recognition had believed that limiting the computer to “biologically plausible” calculations would reduce the machine’s performance compared with approaches that didn’t limit the functions a programmer could include. “However, the latest work by this group has produced some of the best results on standard object classification experiments,” he says. Even Poggio was surprised: he’d thought vision was too poorly understood for researchers to successfully mimic the brain.

“These are first-rate people, and they have some of the best technology for learning and recognizing objects,” says Pietro Perona, a computer vision expert at the California Institute of Technology. But turning the technology into a marketable product, he says, will still take some work.

Bileschi and Lior Wolf, a postdoc in Poggio’s lab, recently presented their work to a conference at MIT’s Deshpande Center for Technological Innovation, along with students from MIT’s Sloan School of Management. They’re hoping to attract interest from someone who will help fund the development needed to take this research from the lab into the marketplace. Although he’s not sure what the best application of the technology would be, Bileschi feels certain that it’s mature enough to be marketable. “It’s no longer a neat toy that does something interesting most of the time,” he says. “It’s at the point where it does something useful almost all of the time.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.