The video-processing system also uses algorithms that can describe the movement of objects in successive frames. It generates sentences like “boat1 follows boat2 between 35:56 and 37:23” or “boat3 approaches maritime marker at 40:01.” “Sometimes it can do a match on an object that has left and reentered a scene,” says Zhu, “and say, for example, this is probably a certain car again.” It is also possible to define virtual “trip wires” to help it describe certain events, like a car running a stop sign (see video).
Although the system demonstrates a step toward what Zhu calls a “grand vision in computer science,” I2T is not yet ready for commercialization. Processing surveillance footage is relatively easy for the software because the camera–and hence the background in a scene–is static; I2T is far from capable of recognizing the variety of objects or situations a human could. If set loose on random images or videos found online, for example, I2T would struggle to perform so well.
Improving the system’s knowledge of how to identify objects and scenes by adding to the number of images in the Lotus Hill Institute training set should help, says Zhu.
The I2T system underlying the surveillance prototype is powerful, says Zu Kim, a researcher at the University of California, Berkeley, who researches the use of computer vision to aid traffic surveillance and vehicle tracking. “It’s a really nice piece of work,” he says, even if it can’t come close to matching human performance.
Kim explains that better image parsing is relevant to artificial intelligence work of all kinds. “There are very many possibilities for a good image parser–for example, allowing a blind person to understand an image on the Web.”
Kim can see other uses for generating text from video, pointing out that it could be fed into a speech synthesizer. “It could be helpful if someone was driving and needed to know what a surveillance camera was seeing.” But humans are visual creatures, he adds, and in many situations could be expected to prefer to decide what’s happening in an image or a video for themselves.