Skip to Content

AI Is Learning to See the World—But Not the Way Humans Do

AI systems are modeled after human biology, but their vision systems still work quite differently.

Computer vision has been having a moment. No more does an image recognition algorithm make dumb mistakes when looking at the world: these days, it can accurately tell you that an image contains a cat. But the way it pulls off the party trick may not be as familiar to humans as we thought.

Most computer vision systems identify features in images using neural networks, which are inspired by our own biology and are very similar in their architecture—only here, the biological sensing and neurons are swapped out for mathematical functions. Now a study by researchers at Facebook and Virginia Tech says that despite those similarities, we should be careful in assuming that both work in the same way.

To see exactly what was happening as both humans and AI analyzed an image, the researchers studied where the two focused their attention. Both were provided with blurred images and asked questions about what was happening in the picture—“Where is the cat?” for instance. Parts of the image could be selectively sharpened, one at a time, and both human and AI did so until they could answer the question. The team repeated the tests using several different algorithms.

Obviously they could both provide answers—but the interesting result is how they did so. On a scale of 1 to -1, where 1 is total agreement and -1 total disagreement, two humans scored on average 0.63 in terms of where they focused their attention across the image. With a human and an AI, the average dropped to 0.26.

In other words: the AI and human were both looking at the same image, both being asked the same question, both getting it right—but using different visual features to arrive at those same conclusions.

This is an explicit result about a phenomenon that researchers had already hinted at. In 2014, a team from Cornell University and the University of Wyoming showed that it was possible to create images that fool AI into seeing something, simply by creating a picture made up of the strong visual features that the software had come to associate with an object. Humans have a large pool of common-sense knowledge to draw on, which means they don’t get caught out by such tricks. That's something researchers are trying to incorporate into a new breed of intelligent software that understands the semantic visual world.

But just because computers don’t use the same approach doesn’t necessarily mean they’re inferior. In fact, they may be better off ignoring the human approach altogether.

The kinds of neural networks used in computer vision usually employ a technique known as supervised learning to work out what’s happening in an image. Ultimately, their ability to associate a complex combination of patterns, textures, and shapes with the name of an object is made possible by providing the AI with a training set of images whose contents have already been labeled by a human.

But teams at Facebook and Google’s DeepMind have been experimenting with unsupervised learning systems that ingest content from video and images to learn what human faces and everyday objects look like, without any human intervention. Magic Pony, recently bought by Twitter, also shuns supervised learning, instead learning to recognize statistical patterns in images to teach itself what edges, textures, and other features should look like.

In these cases, it’s perhaps even less likely that the knowledge of the AI will be generated through a process aping that of a human. Once inspired by human brains, AI may beat us by simply being itself.

(Read more: New Scientist, “The Missing Link of Artificial Intelligence”, “‘Smart’ Software Can Be Tricked into Seeing What Isn’t There”)

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

This baby with a head camera helped teach an AI how kids learn language

A neural network trained on the experiences of a single young child managed to learn one of the core components of language: how to match words to the objects they represent.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.