Computer vision has been having a moment. No more does an image recognition algorithm make dumb mistakes when looking at the world: these days, it can accurately tell you that an image contains a cat. But the way it pulls off the party trick may not be as familiar to humans as we thought.
Most computer vision systems identify features in images using neural networks, which are inspired by our own biology and are very similar in their architecture—only here, the biological sensing and neurons are swapped out for mathematical functions. Now a study by researchers at Facebook and Virginia Tech says that despite those similarities, we should be careful in assuming that both work in the same way.
To see exactly what was happening as both humans and AI analyzed an image, the researchers studied where the two focused their attention. Both were provided with blurred images and asked questions about what was happening in the picture—“Where is the cat?” for instance. Parts of the image could be selectively sharpened, one at a time, and both human and AI did so until they could answer the question. The team repeated the tests using several different algorithms.
Obviously they could both provide answers—but the interesting result is how they did so. On a scale of 1 to -1, where 1 is total agreement and -1 total disagreement, two humans scored on average 0.63 in terms of where they focused their attention across the image. With a human and an AI, the average dropped to 0.26.
In other words: the AI and human were both looking at the same image, both being asked the same question, both getting it right—but using different visual features to arrive at those same conclusions.
This is an explicit result about a phenomenon that researchers had already hinted at. In 2014, a team from Cornell University and the University of Wyoming showed that it was possible to create images that fool AI into seeing something, simply by creating a picture made up of the strong visual features that the software had come to associate with an object. Humans have a large pool of common-sense knowledge to draw on, which means they don’t get caught out by such tricks. That's something researchers are trying to incorporate into a new breed of intelligent software that understands the semantic visual world.
But just because computers don’t use the same approach doesn’t necessarily mean they’re inferior. In fact, they may be better off ignoring the human approach altogether.
The kinds of neural networks used in computer vision usually employ a technique known as supervised learning to work out what’s happening in an image. Ultimately, their ability to associate a complex combination of patterns, textures, and shapes with the name of an object is made possible by providing the AI with a training set of images whose contents have already been labeled by a human.
But teams at Facebook and Google’s DeepMind have been experimenting with unsupervised learning systems that ingest content from video and images to learn what human faces and everyday objects look like, without any human intervention. Magic Pony, recently bought by Twitter, also shuns supervised learning, instead learning to recognize statistical patterns in images to teach itself what edges, textures, and other features should look like.
In these cases, it’s perhaps even less likely that the knowledge of the AI will be generated through a process aping that of a human. Once inspired by human brains, AI may beat us by simply being itself.