Video Searching

Better methods for finding a face in the crowd.

David Vossarchive page

July 1, 2001

Every day, a river of video floods the airwaves, courses through cables and streams over the Internet. Add to that all the films ever made, plus all of the video material created for private use, and you’ve got an ocean of light and sound. But how can you ever find and retrieve a particular video clip?

With text documents, you can type in a query, and a piece of software finds the matching text strings. Searching video is much tougher. Unless someone has gone back and somehow marked the video data, it’s now nearly impossible to find a specific image. A content provider like CNN has more than a hundred thousand hours of tape in its video archive-far too much for any human to view and annotate manually. Now a small but growing number of labs are searching for novel ways to better navigate the video glut.

These are still early days for video indexing and retrieval. A few existing Web search engines like AltaVista can find some video clips, but they only return those that are on Web pages with text that can be searched by keywords. Likewise, San Mateo, CA-based Virage has developed a search engine for ABCNews.com that allows the transcript of a broadcast to be searched; the search, however, is also by keywords, and the video is played from the point at which the specified word occurred. None of these systems provides direct image searches-in other words, a video answer to the command “give me all the clips of an astronaut outside the space station Mir.”

Video-search and database tools that directly find images can be far more powerful than keyword searches. At Columbia University, a team led by Shih-Fu Chang is developing software that can search a video for particular features in the images-such as shape, color and motion. For example, you could select a static image from a catalogue and have the software find close matches in the video frames. Or you could make a simple sketch of a blob, with a few arrows to show how it moves, and the system finds video segments that match these features. For instance, you could roughly sketch the shape of the Mir space station and a human figure moving outside it.

This kind of direct image query could be especially useful for large databases of video records. Chang’s group has been researching ways to extract information from medical-exam videos. Every year at Columbia-Presbyterian Medical Center, “ten thousand echocardiograms [ultrasound movies of the heart] are performed,” he explains. “Each is about a half-hour long, and they get put into a tape library.” A cardiologist then has to look up the ultrasound to make a diag-nosis, wasting a lot of time fast-forwarding and rewinding through the tape. Much better would be an automatic way of detecting signs of heart ailment in the video stream. Chang’s software first parses the ultrasound video into segments by looking for sharp changes in image content-when the view on the ultrasound display is switched to another angle, for instance. Each segment is then processed by a “view recognizer” that matches the images to known images of abnormal events and flags any suspected heart conditions.

At Carnegie Mellon University, researchers are creating a digital library that combines natural-language processing, speech recognition and image analysis. “The integration of these different technologies is the key,” says Howard D. Wactlar, director of the Informedia Digital Video Library Project at Carnegie Mellon. A prototype captures news broadcasts from around the world and stores them, along with summaries or storyboards. Someone can then type in a question, or just utter the question aloud: “Tell me about oxygen problems on the Russian space station Mir.” All the relevant news clips are displayed as frame icons you can click on. The system is also incorporating face recognition to make it possible to call up all the clips of a particular person.

It will be some time before direct video searches become routine. But if today’s research pays off, finding a video needle in the immense multimedia haystack will be no more difficult than typing in a few words-or maybe sketching out a simple image.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.