Select your localized edition:

Close ×

More Ways to Connect

Discover one of our 28 local entrepreneurial communities »

Be the first to know as we launch in new countries and markets around the globe.

Interested in bringing MIT Technology Review to your local market?

MIT Technology ReviewMIT Technology Review - logo


Unsupported browser: Your browser does not meet modern web standards. See how it scores »

{ action.text }

The video-processing system also uses algorithms that can describe the movement of objects in successive frames. It generates sentences like “boat1 follows boat2 between 35:56 and 37:23” or “boat3 approaches maritime marker at 40:01.” “Sometimes it can do a match on an object that has left and reentered a scene,” says Zhu, “and say, for example, this is probably a certain car again.” It is also possible to define virtual “trip wires” to help it describe certain events, like a car running a stop sign (see video).

Although the system demonstrates a step toward what Zhu calls a “grand vision in computer science,” I2T is not yet ready for commercialization. Processing surveillance footage is relatively easy for the software because the camera–and hence the background in a scene–is static; I2T is far from capable of recognizing the variety of objects or situations a human could. If set loose on random images or videos found online, for example, I2T would struggle to perform so well.

Improving the system’s knowledge of how to identify objects and scenes by adding to the number of images in the Lotus Hill Institute training set should help, says Zhu.

The I2T system underlying the surveillance prototype is powerful, says Zu Kim, a researcher at the University of California, Berkeley, who researches the use of computer vision to aid traffic surveillance and vehicle tracking. “It’s a really nice piece of work,” he says, even if it can’t come close to matching human performance.

Kim explains that better image parsing is relevant to artificial intelligence work of all kinds. “There are very many possibilities for a good image parser–for example, allowing a blind person to understand an image on the Web.”

Kim can see other uses for generating text from video, pointing out that it could be fed into a speech synthesizer. “It could be helpful if someone was driving and needed to know what a surveillance camera was seeing.” But humans are visual creatures, he adds, and in many situations could be expected to prefer to decide what’s happening in an image or a video for themselves.

3 comments. Share your thoughts »

Credit: Song-Chun Zhu/UCLA
Video by University of California, Los Angeles, and ObjectVideo of Reston, Virginia

Tagged: Computing, software, artificial intelligence, video, surveillance, image analysis, image processing

Reprints and Permissions | Send feedback to the editor

From the Archives


Introducing MIT Technology Review Insider.

Already a Magazine subscriber?

You're automatically an Insider. It's easy to activate or upgrade your account.

Activate Your Account

Become an Insider

It's the new way to subscribe. Get even more of the tech news, research, and discoveries you crave.

Sign Up

Learn More

Find out why MIT Technology Review Insider is for you and explore your options.

Show Me