Skip to Content

Google’s Brain-Inspired Software Describes What It Sees in Complex Images

Experimental Google software that can describe a complex scene could lead to better image search or apps to help the visually impaired.
November 18, 2014

Researchers at Google have created software that can use complete sentences to accurately describe scenes shown in photos—a significant advance in the field of computer vision. When shown a photo of a game of ultimate Frisbee, for example, the software responded with the description “A group of young people playing a game of frisbee.” The software can even count, giving answers such as “Two pizzas sitting on top of a stove top oven.”

Experimental software from Google can accurately describe scenes in photos, like the two on the left. But it still makes mistakes, as seen with the two photos on the right.

Previously, most efforts to create software that understands images have focused on the easier task of identifying single objects.

“It’s very exciting,” says Oriol Vinyals, a research scientist at Google. “I’m sure there are going to be some potential applications coming out of this.”

The new software is the latest product of Google’s research into using large collections of simulated neurons to process data (see “10 Breakthrough Technologies 2013: Deep Learning”). No one at Google programmed the new software with rules for how to interpret scenes. Instead, its networks “learned” by consuming data. Though it’s just a research project for now, Vinyals says, he and others at Google have already begun to think about how it could be used to enhance image search or help the visually impaired navigate online or in the real world.

Google’s researchers created the software through a kind of digital brain surgery, plugging together two neural networks developed separately for different tasks. One network had been trained to process images into a mathematical representation of their contents, in preparation for identifying objects. The other had been trained to generate full English sentences as part of automated translation software.

When the networks are combined, the first can “look” at an image and then feed the mathematical description of what it “sees” into the second, which uses that information to generate a human-readable sentence. The combined network was trained to generate more accurate descriptions by showing it tens of thousands of images with descriptions written by humans. “We’re seeing through language what it thought the image was,” says Vinyals.

After that training process, the software was set loose on several large data sets of images from Flickr and other sources and asked to describe them. The accuracy of its descriptions was then judged with an automated test used to benchmark computer-vision software. Google’s software posted scores in the 60s on a 100-point scale. Humans doing the test typically score in 70s, says Vinyals.

That result suggests Google is far ahead of other researchers working to create scene-describing software. Stanford researchers recently published details of their own system and reported that it scored between 40 and 50 on the same standard test.

However, Vinyals notes that researchers at Google and elsewhere are still in the early stages of understanding how to create and test this kind of software. When Google asked humans to rate its software’s descriptions of images on a scale of 1 to 4, it averaged only 2.5, suggesting that it still has a long way to go.

Vinyals predicts that research on understanding and describing scenes will now intensify. One problem that could slow things down: though large databases of hand-labeled images have been created to train software to recognize individual objects, there are fewer labeled photos of more natural scenes.

Microsoft this year launched a database called COCO to try to fix that. Google used COCO in its new research, but it is still relatively small. “I hope other parties will chip in and make it better,” says Vinyals. 

Keep Reading

Most Popular

A Roomba recorded a woman on the toilet. How did screenshots end up on Facebook?

Robot vacuum companies say your images are safe, but a sprawling global supply chain for data from our devices creates risk.

A startup says it’s begun releasing particles into the atmosphere, in an effort to tweak the climate

Make Sunsets is already attempting to earn revenue for geoengineering, a move likely to provoke widespread criticism.

10 Breakthrough Technologies 2023

Every year, we pick the 10 technologies that matter the most right now. We look for advances that will have a big impact on our lives and break down why they matter.

These exclusive satellite images show that Saudi Arabia’s sci-fi megacity is well underway

Weirdly, any recent work on The Line doesn’t show up on Google Maps. But we got the images anyway.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.