Hello,

We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.

Intelligent Machines

The Next Big Step for AI? Understanding Video

Perceiving dynamic actions could be a huge advance in how software makes sense of the world.

A screenshot from one of the videos in the Moments in Time Dataset, which could help AI better understand video content.
Moments in Time Dataset

For a computer, recognizing a cat or a duck in a still image is pretty clever. But a stiffer test for artificial intelligence will be understanding when the cat is riding a Roomba and chasing the duck around a kitchen.

MIT and IBM this week released a vast data set of video clips painstakingly annotated with details of the action being carried out. The Moments in Time Dataset includes three-second snippets of everything from fishing to break-dancing.

“A lot of things in the world change from one second to the next,” says Aude Oliva, a principal research scientist at MIT and one of the people behind the project. “If you want to understand why something is happening, motion gives you lot of information that you cannot capture in a single frame.”

The current boom in artificial intelligence was sparked, in part, by success in teaching computers to recognize the contents of static images by training deep neural networks on large labeled data sets (see “The Revolutionary Technique That Quietly Changed Machine Vision Forever”).

AI systems that interpret video today, including the systems found in some self-driving cars, often rely on identifying objects in static frames rather than interpreting actions. On Monday Google launched a tool capable of recognizing the objects in video as part of its Cloud Platform, a service that already includes AI tools for processing image, audio, and text.

This shows the areas of the video frames that a neural network is focusing on in order to recognize the event in the video.

The next challenge may be teaching machines to understand not just what a video contains, but what’s happening in the footage as well. That could have some practical benefits, perhaps leading to powerful new ways of searching, annotating, and mining video footage. It also figures to give robots or self-driving cars a better understanding of how the world around them is unfolding.

The MIT-IBM project is in fact just one of several video data sets designed to spur progress in training machines to understand actions in the physical world. Last year, for example, Google released a set of eight million tagged YouTube videos called YouTube-8M. Facebook is developing an annotated data set of video actions called the Scenes, Actions, and Objects set.

Olga Russakovsky, an assistant professor at Princeton University who specializes in computer vision, says it has proved difficult to develop useful video data sets because they require more storage and computing power than still images do. “I’m excited to play with this new data,” she says. “I think the three-second length is great—it provides temporal context while keeping the storage and computation requirements low.”

Sign up for The Download
Your daily dose of what's up in emerging technology
Manage your newsletter preferences

Others are taking a more creative approach. Twenty Billion Neurons, a startup based in Toronto and Berlin, created a custom data set by paying crowdsourced workers to perform simple tasks. One of the company’s cofounders, Roland Memisevic, says it also uses a neural network designed specifically to process temporal vision information.

“Networks trained on the other data sets can tell you whether the video shows a soccer match or a party,” he says. “Our networks can tell you whether someone just entered the room.”

Danny Gutfreund, a researcher at IBM who collaborated on the project, says recognizing actions effectively will require that machines learn about, say, a person taking an action and transfer this knowledge to a case where, say, an animal is performing the same action. Progress in this area, known as transfer learning, will be important for the future of AI. “Let’s see how machines can do this transfer learning, this analogy, that we do very well,” he says.

Gutfreund adds that the technology could have practical applications. “You could use it for elder care, telling if someone has fallen or if they have taken their medicine,” he says. “You can think of devices that help blind people.”

Missed EmTech Digital?
Don’t get left behind. Watch who took the stage in San Francisco.

Watch Now!
A screenshot from one of the videos in the Moments in Time Dataset, which could help AI better understand video content.
Moments in Time Dataset
More from Intelligent Machines

Artificial intelligence and robots are transforming how we work and live.

Want more award-winning journalism? Subscribe to Insider Plus.
  • Insider Plus {! insider.prices.plus !}*

    {! insider.display.menuOptionsLabel !}

    Everything included in Insider Basic, plus the digital magazine, extensive archive, ad-free web experience, and discounts to partner offerings and MIT Technology Review events.

    See details+

    Print + Digital Magazine (6 bi-monthly issues)

    Unlimited online access including all articles, multimedia, and more

    The Download newsletter with top tech stories delivered daily to your inbox

    Technology Review PDF magazine archive, including articles, images, and covers dating back to 1899

    10% Discount to MIT Technology Review events and MIT Press

    Ad-free website experience

/3
You've read of three free articles this month. for unlimited online access. You've read of three free articles this month. for unlimited online access. This is your last free article this month. for unlimited online access. You've read all your free articles this month. for unlimited online access. You've read of three free articles this month. for more, or for unlimited online access. for two more free articles, or for unlimited online access.