Google’s Brain-Inspired Software Describes What It Sees in Complex Images

Experimental Google software that can describe a complex scene could lead to better image search or apps to help the visually impaired.

Tom Simonitearchive page

November 18, 2014

Researchers at Google have created software that can use complete sentences to accurately describe scenes shown in photos—a significant advance in the field of computer vision. When shown a photo of a game of ultimate Frisbee, for example, the software responded with the description “A group of young people playing a game of frisbee.” The software can even count, giving answers such as “Two pizzas sitting on top of a stove top oven.”

Experimental software from Google can accurately describe scenes in photos, like the two on the left. But it still makes mistakes, as seen with the two photos on the right.

Previously, most efforts to create software that understands images have focused on the easier task of identifying single objects.

“It’s very exciting,” says Oriol Vinyals, a research scientist at Google. “I’m sure there are going to be some potential applications coming out of this.”

The new software is the latest product of Google’s research into using large collections of simulated neurons to process data (see “10 Breakthrough Technologies 2013: Deep Learning”). No one at Google programmed the new software with rules for how to interpret scenes. Instead, its networks “learned” by consuming data. Though it’s just a research project for now, Vinyals says, he and others at Google have already begun to think about how it could be used to enhance image search or help the visually impaired navigate online or in the real world.

Google’s researchers created the software through a kind of digital brain surgery, plugging together two neural networks developed separately for different tasks. One network had been trained to process images into a mathematical representation of their contents, in preparation for identifying objects. The other had been trained to generate full English sentences as part of automated translation software.

When the networks are combined, the first can “look” at an image and then feed the mathematical description of what it “sees” into the second, which uses that information to generate a human-readable sentence. The combined network was trained to generate more accurate descriptions by showing it tens of thousands of images with descriptions written by humans. “We’re seeing through language what it thought the image was,” says Vinyals.

After that training process, the software was set loose on several large data sets of images from Flickr and other sources and asked to describe them. The accuracy of its descriptions was then judged with an automated test used to benchmark computer-vision software. Google’s software posted scores in the 60s on a 100-point scale. Humans doing the test typically score in 70s, says Vinyals.

That result suggests Google is far ahead of other researchers working to create scene-describing software. Stanford researchers recently published details of their own system and reported that it scored between 40 and 50 on the same standard test.

However, Vinyals notes that researchers at Google and elsewhere are still in the early stages of understanding how to create and test this kind of software. When Google asked humans to rate its software’s descriptions of images on a scale of 1 to 4, it averaged only 2.5, suggesting that it still has a long way to go.

Vinyals predicts that research on understanding and describing scenes will now intensify. One problem that could slow things down: though large databases of hand-labeled images have been created to train software to recognize individual objects, there are fewer labeled photos of more natural scenes.

Microsoft this year launched a database called COCO to try to fix that. Google used COCO in its new research, but it is still relatively small. “I hope other parties will chip in and make it better,” says Vinyals.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.