Search engines work wonderfully when you want to find something in a long stretch of text. Just type in a word or phrase, and the computer quickly scans through a Web page or Word document and picks it out. But for a computer to do the same thing with an image – find a particular person or object somewhere in a video recording, for instance – is much more difficult. Whereas a human eye instantly distinguishes a tree from a cat, it’s a lot of work to teach a computer to do the same.
That challenge is being tackled by researchers at MIT’s Center for Biological and Computational Learning (CBCL), led by Tomaso Poggio, the Eugene McDermott Professor in the Brain Sciences and Human Behavior. Some students at the center are proposing software that could work, say, with surveillance cameras in an office building or military base, eliminating the need for a human to watch monitors or review videotapes. Other applications might automate computer editing of home movies, or sort and retrieve photos from a vast database of images. It might also be possible to train a computer to perform preliminary medical diagnoses based on an MRI or CT scan image.
But the work to make such exciting applications possible is daunting. “The fact that it seems so easy to do for a human is part of our greatest illusion,” says Stanley Bileschi, who this month earned his PhD in electrical engineering and computer science at the CBCL. Processing visual data is computationally complex, he says, noting that people use about 40 percent of their brains just on that task. There are many variables to take into account when identifying an object: color, lighting, spatial orientation, distance, and texture. And vision both stimulates and is influenced by other brain functions, such as memory and reasoning, which are not fully understood. “Evolution has spent four billion years developing vision,” Poggio says.
Scientists have traditionally used statistical learning systems to teach computers to recognize objects. In such systems, a scientist tells a machine that certain images are faces, then tells it that certain other images are not faces. The computer examines the images pixel by pixel to figure out, statistically, what the face images have in common that the nonface images do not.
For instance, it might learn that a set of pixels representing the brow is a brighter than the pixels representing the pupils, and that the two sets are a standard distance apart. It might notice that the mouth tends to be horizontal, and that there is a sharp change in brightness where the head stops and the background begins. Once trained, it can look at new images and see how closely they match the rules.