Intelligent Machines

Will Artificial Intelligence Win the Caption Contest?

Neural networks have mastered the ability to label things in images, and now they’re learning to tell stories from a set of photos.

When social-media users upload photographs and caption them, they don’t just label their contents. They tell a story, which gives the photos context and additional emotional meaning.

A paper published by Microsoft Research describes an image captioning system that mimics humans’ unique style of visual storytelling. Companies like Microsoft, Google, and Facebook have spent years teaching computers to label the contents of images, but this new research takes it a step further by teaching a neural-network-based system to infer a story from several images. Someday it could be used to automatically generate descriptions for sets of images, or to bring humanlike language to other applications for artificial intelligence.

“Rather than giving bland or vanilla descriptions of what’s happening in the images, we put those into a larger narrative context,” says Frank Ferraro, a Johns Hopkins University PhD student who coauthored the paper. “You can start making likely inferences of what might be happening.”

Consider an album of pictures depicting a group of friends celebrating a birthday at a bar. Some of the early pictures show people ordering beer and drinking it, while a later photo shows someone asleep on a couch.

“A captioning system might just say, ‘A person lying on a couch,’” Ferraro says. “But a storytelling system might be able to say, ‘Well, given that I think these people were out partying or out eating and drinking, then this person may be drunk.’”

One example listed in the paper includes a series of five images. They show a family gathered around a table, a plate of shellfish, a dog, and images from the beach. The neural network described them with a story reading, “The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water.”

Slideshow: The dog was ready to go.

Click on the above image to see slideshow of other pictures.
Slideshow: He had a great time on the hike.

Click on the above image to see slideshow of other pictures.
Slideshow: And was very happy to be in the field.

Click on the above image to see slideshow of other pictures.
Slideshow: His mom was so proud of him.

Click on the above image to see slideshow of other pictures.
Slideshow: It was a beautiful day for him.

Click on the above image to see slideshow of other pictures.

The team, which was led by Microsoft researcher Margaret Mitchell and included Microsoft interns like Ferraro and a researcher from Facebook AI, turned what’s called a sequence-to-sequence recurrent neural network into a storyteller by training it with images sourced from Flickr. They had helpers write captions for individual images and for series of images in specific sequences.

An approach similar to those used to label the contents of single photos produced stories that were too generic. To counter this, the team developed a way for the network to choose words that were likely to be visually salient. They also required that the system not repeat words.

Storytelling is an important part of being human, says Stanford Vision Lab director Fei-Fei Li, who did not contribute to the research. Technology that can imitate humans’ techniques for documenting stories needs to be able to cross-reference objects and characters seen in multiple pictures and infer relationships between people, objects, and places.

“The published paper is just the beginning toward this kind of technology,” Li says. “But it is a good step forward to start tackling such an ambitious project. I look forward to more follow-up work from these authors and others.”

Hear more about artificial intelligence at EmTech MIT 2017.

Register now

Uh oh–you've read all of your free articles for this month.

Insider Premium
$179.95/yr US PRICE

More from Intelligent Machines

Artificial intelligence and robots are transforming how we work and live.

Want more award-winning journalism? Subscribe to Insider Premium.
  • Insider Premium {! insider.prices.premium !}*

    {! insider.display.menuOptionsLabel !}

    Our award winning magazine, unlimited access to our story archive, special discounts to MIT Technology Review Events, and exclusive content.

    See details+

    What's Included

    Bimonthly magazine delivery and unlimited 24/7 access to MIT Technology Review’s website

    The Download: our daily newsletter of what's important in technology and innovation

    Access to the magazine PDF archive—thousands of articles going back to 1899 at your fingertips

    Special discounts to select partner offerings

    Discount to MIT Technology Review events

    Ad-free web experience

    First Look: exclusive early access to important stories, before they’re available to anyone else

    Insider Conversations: listen in on in-depth calls between our editors and today’s thought leaders

/
You've read all of your free articles this month. This is your last free article this month. You've read of free articles this month. or  for unlimited online access.