Skip to Content

Google’s Brain-Inspired Software Describes What It Sees in Complex Images

Experimental Google software that can describe a complex scene could lead to better image search or apps to help the visually impaired.
November 18, 2014

Researchers at Google have created software that can use complete sentences to accurately describe scenes shown in photos—a significant advance in the field of computer vision. When shown a photo of a game of ultimate Frisbee, for example, the software responded with the description “A group of young people playing a game of frisbee.” The software can even count, giving answers such as “Two pizzas sitting on top of a stove top oven.”

Experimental software from Google can accurately describe scenes in photos, like the two on the left. But it still makes mistakes, as seen with the two photos on the right.

Previously, most efforts to create software that understands images have focused on the easier task of identifying single objects.

“It’s very exciting,” says Oriol Vinyals, a research scientist at Google. “I’m sure there are going to be some potential applications coming out of this.”

The new software is the latest product of Google’s research into using large collections of simulated neurons to process data (see “10 Breakthrough Technologies 2013: Deep Learning”). No one at Google programmed the new software with rules for how to interpret scenes. Instead, its networks “learned” by consuming data. Though it’s just a research project for now, Vinyals says, he and others at Google have already begun to think about how it could be used to enhance image search or help the visually impaired navigate online or in the real world.

Google’s researchers created the software through a kind of digital brain surgery, plugging together two neural networks developed separately for different tasks. One network had been trained to process images into a mathematical representation of their contents, in preparation for identifying objects. The other had been trained to generate full English sentences as part of automated translation software.

When the networks are combined, the first can “look” at an image and then feed the mathematical description of what it “sees” into the second, which uses that information to generate a human-readable sentence. The combined network was trained to generate more accurate descriptions by showing it tens of thousands of images with descriptions written by humans. “We’re seeing through language what it thought the image was,” says Vinyals.

After that training process, the software was set loose on several large data sets of images from Flickr and other sources and asked to describe them. The accuracy of its descriptions was then judged with an automated test used to benchmark computer-vision software. Google’s software posted scores in the 60s on a 100-point scale. Humans doing the test typically score in 70s, says Vinyals.

That result suggests Google is far ahead of other researchers working to create scene-describing software. Stanford researchers recently published details of their own system and reported that it scored between 40 and 50 on the same standard test.

However, Vinyals notes that researchers at Google and elsewhere are still in the early stages of understanding how to create and test this kind of software. When Google asked humans to rate its software’s descriptions of images on a scale of 1 to 4, it averaged only 2.5, suggesting that it still has a long way to go.

Vinyals predicts that research on understanding and describing scenes will now intensify. One problem that could slow things down: though large databases of hand-labeled images have been created to train software to recognize individual objects, there are fewer labeled photos of more natural scenes.

Microsoft this year launched a database called COCO to try to fix that. Google used COCO in its new research, but it is still relatively small. “I hope other parties will chip in and make it better,” says Vinyals. 

Keep Reading

Most Popular

This nanoparticle could be the key to a universal covid vaccine

Ending the covid pandemic might well require a vaccine that protects against any new strains. Researchers may have found a strategy that will work.

How do strong muscles keep your brain healthy?

There’s a robust molecular language being spoken between your muscles and your brain.

The 1,000 Chinese SpaceX engineers who never existed

LinkedIn users are being scammed of millions of dollars by fake connections posing as graduates of prestigious universities and employees at top tech companies.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.