How Computers Can Tell What They’re Looking At

Images from inside an artificial neural network help explain why a technique called deep learning is enabling software to see.

Tom Simonitearchive page

April 11, 2016

Software has lately become much, much better at understanding images. Last year Microsoft and Google showed off systems more accurate than humans at recognizing objects in photos, as judged by the standard benchmark researchers use.

That became possible thanks to a technique called deep learning, which involves passing data through networks of roughly simulated neurons to train them to filter future data (see “Teaching Machines to Understand Us”). Deep learning is why you can search images stored in Google Photos using keywords, and why Facebook recognizes your friends in photos before you’ve tagged them. Using deep learning on images is also making robots and self-driving cars more practical, and it could revolutionize medicine.

That power and flexibility come from the way an artificial neural network can figure out which visual features to look for in images when provided with lots of labeled example photos. The neural networks used in deep learning are arranged into a hierarchy of layers that data passes through in sequence. During the training process, different layers in the network become specialized to identify different types of visual features. The type of neural network used on images, known as a convolutional net, was inspired by studies on the visual cortex of animals.

“These networks are a huge leap over traditional computer vision methods, since they learn directly from the data they are fed,” says Matthew Zeiler, CEO of Clarifai, which offers an image recognition service used by companies including BuzzFeed to organize and search photos and video. Programmers used to have to invent the math software needed to look for visual features, and the results weren’t good enough to build many useful products.

Zeiler developed a way to visualize the workings of neural networks as a grad student working with Rob Fergus at NYU. The images in the slideshow above take you inside a deep-learning network trained with 1.3 million photos for the standard image recognition test on which systems from Microsoft and others can now beat humans. It asks software to spot 1,000 different objects as diverse as mosquito nets and mosques. Each image shows visual features that most strongly activate neurons in one layer of the network.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.