Neuroscientists at MIT have developed a computer model that mimics the human vision system to accurately detect and recognize objects in a busy street scene, such as cars and motorcycles.
Such biologically inspired vision systems could soon be used in surveillance systems, or in smart sensors that can warn drivers of pedestrians and other obstacles. It may also help in the development of so-called visual search engines, says Thomas Serre, a neuroscientist at the Center for Biological and Computational Learning at MIT’s McGovern Institute for Brain Research, who was involved in the project.
Researchers have been interested for years in trying to copy biological vision systems, simply because they are so good, says David Hogg, a computer vision expert at Leeds University in the UK. “This is a very successful example of [mimicking biological vision],” he says.
Teaching a computer to classify objects has proved much harder than was originally anticipated, says Serre, who carried out the work with Tomaso Poggio, codirector of the center. On the one hand, to recognize a particular type of object, such as a car, a computer needs a template or computational representation specific to that particular object. Such a template enables the computer to distinguish a car from objects in other classes–noncars. Yet this representation must be sufficiently flexible to include all types of cars–no matter how varied in appearance–at different angles, positions, and poses, and under different lighting conditions.
“You want to be able to recognize an object anywhere in the field of vision, irrespective of where it is and irrespective of its size,” says Serre. Yet if you analyze images just by their patterns of light and dark pixels, then two portrait images of different people can end up looking more similar than two images of the same person taken from different angles.
The most effective method for getting around such problems is to train a learning algorithm on a set of images and allow it to extract the features they have in common; two wheels aligned with the road could signal a car, for example. Serre and Poggio believe that the human vision system uses a similar approach, but one that depends on a hierarchy of successive layers in the visual cortex. The first layers of the cortex detect an object’s simpler features, such as edges, and higher layers integrate that information to form our perception of the object as a whole.
To test their theory, Serre and Poggio worked with Stanley Bileschi, also at MIT, and Lior Wolf, a member of the computer science department at Tel Aviv University in Israel, to create a computer model comprising 10 million computational units, each designed to behave like clusters of neurons in the visual cortex. Just as in the cortex, the clusters are organized into layers.
When the model first learns to “see,” some of the cell-like units extract rudimentary features from the scene, such as oriented edges, by analyzing very small groups of pixels. “These neurons are typically like pinholes that look at a small portion of the visual field,” says Serre. More-complex units are able to take in a larger portion of the image and recognize features regardless of their size or position. For example, if the simple units detect vertical and horizontal edges, a more complex unit could use that information to detect a corner.
With each successive layer, increasingly complex features are extracted from the image. So are relationships between features, such as the distance between two parts of an object or the different angles at which the two parts are oriented. This information allows the system to recognize the same object at different angles.
“It was a surprise to us when we applied this model to real-world visual tasks and it competed well with the best systems,” says Serre. Indeed, in some tests their model successfully recognized objects more than 95 percent of the time, on average. The more images the system is trained on, the more accurately it performs.
“Maybe we shouldn’t be surprised,” says David Lowe, a computer vision and object recognition expert at the University of British Colombia in Vancouver. “Human vision is vastly better at recognition than any of our current computer systems, so any hints of how to proceed from biology are likely to be very useful.”
At the moment, the system is designed to analyze only still images. But this is very much in line with the way the human vision system works, says Serre. The inputs to the visual cortex are shared by a system that deals with shapes and textures while a separate system deals with movement, he says. The team is now working on incorporating a parallel system to cope with video.