Intelligent Machines

How Computers Can Tell What They’re Looking At

Images from inside an artificial neural network help explain why a technique called deep learning is enabling software to see.

Apr 11, 2016

Software has lately become much, much better at understanding images. Last year Microsoft and Google showed off systems more accurate than humans at recognizing objects in photos, as judged by the standard benchmark researchers use.

That became possible thanks to a technique called deep learning, which involves passing data through networks of roughly simulated neurons to train them to filter future data (see “Teaching Machines to Understand Us”). Deep learning is why you can search images stored in Google Photos using keywords, and why Facebook recognizes your friends in photos before you’ve tagged them. Using deep learning on images is also making robots and self-driving cars more practical, and it could revolutionize medicine.

That power and flexibility come from the way an artificial neural network can figure out which visual features to look for in images when provided with lots of labeled example photos. The neural networks used in deep learning are arranged into a hierarchy of layers that data passes through in sequence. During the training process, different layers in the network become specialized to identify different types of visual features. The type of neural network used on images, known as a convolutional net, was inspired by studies on the visual cortex of animals.

The first layers of the neural network look at only small sections of an image at a time and become specialized to detect very simple patterns. You can see how simple categories of color and shading could be used to spot colors and edges.
The next layer in the network uses the filters from the first one like building blocks to look at larger chunks of an image and detect more complex features. It responds to patterns such as corners, stripes, and meshes.
The third layer in the network shows how building up features like those detected by earlier layers makes it possible to distinguish parts of objects. A round orange shape like a tomato can be differentiated from a white furry one, like a rabbit.
The most sophisticated image-processing layers in the network respond to very complex collections of features that correspond to high-level concepts, such as unicycles, dogs, flowers, and people. This is the fifth of eight layers in the network. Its output is translated by the remaining three layers into numerical scores indicating how the network has identified the image it just looked at.

“These networks are a huge leap over traditional computer vision methods, since they learn directly from the data they are fed,” says Matthew Zeiler, CEO of Clarifai, which offers an image recognition service used by companies including BuzzFeed to organize and search photos and video. Programmers used to have to invent the math software needed to look for visual features, and the results weren’t good enough to build many useful products.

Zeiler developed a way to visualize the workings of neural networks as a grad student working with Rob Fergus at NYU. The images in the slideshow above take you inside a deep-learning network trained with 1.3 million photos for the standard image recognition test on which systems from Microsoft and others can now beat humans. It asks software to spot 1,000 different objects as diverse as mosquito nets and mosques. Each image shows visual features that most strongly activate neurons in one layer of the network.