Video Searching by Sight and Script

Researchers have designed an automated system to identify characters in television shows, paving the way for better video search.

Brendan Borrellarchive page

October 11, 2006

Google’s acquisition this week of YouTube.com has raised hopes that searching for video is going to improve. More than 65,000 videos are uploaded to YouTube each day, according to the website. With all that content, finding the right clip can be difficult.

Now researchers have developed a system that uses a combination of face recognition, close-captioning information, and original television scripts to automatically name the faces on that appear on screen, making episodes of the TV show Buffy the Vampire Slayer searchable.

“We basically see this work as one of the first steps in getting automated descriptions of what’s happening in a video,” says Mark Everingham, a computer scientist now at the University of Leeds (formerly of the University of Oxford), who presented his research at the British Machine Vision Conference in September.

Currently, video searches offered by AOL Video, Google, and YouTube do not search the content of a video itself, but instead rely primarily on “metadata,” or text descriptions, written by users to develop a searchable index of Web-based media content.

Users frequently (and illegally) upload bits and pieces of their favorite sitcoms to video-sharing sites such as YouTube. For instance, a recent search for “Buffy the Vampire Slayer” turned up nearly 2,000 clips on YouTube, many of them viewed thousands of times. Most of these clips are less than five minutes and the descriptions are vague. One titled “A new day has come,” for instance, is described by a user thusly: “It mostly contains Buffy and Spike. It shows how Spike was there for Buffy until he died and she felt alone afterward.”

Everingham says previous work in video search has used data from subtitles to find videos, but he’s not aware of anyone using his method, which combines–in the technical tour de force–subtitles and script annotation. The script tells you “what is said and who said it” and subtitles tell you “what time something is said,” he explains. Everingham’s software combines those two sources of information with powerful tools previously developed to track faces and identify speakers without the need for user input.

What made the Buffy project such a challenge, Everingham says, is that in film and television, the person speaking is not always in the shot. The star, Buffy, may be speaking off-screen or facing away from the camera, for instance, and the camera will be showing you the listener’s reactions. Other times, there may be multiple actors on the screen or the actor’s face is not directly facing the camera. All of these ambiguities are easy for humans to interpret, but difficult for computers–at least until now. Everingham says their multimodal system is accurate up to 80 percent of the time.

A single episode of Buffy can have up to 20,000 instances of detected faces, but most of these instances arise from multiple frames of a single character in any given shot. The software tracks key “landmarks” on actor’s faces–nostrils, pupils, and eyes, for instance–and if one of them overlaps with the next frame, the two faces are considered part of a single track. If these landmarks are unclear, though, the software uses a description of clothing to unite two “broken” face tracks. Finally, the software also watches actors’ lips to identify who’s speaking or if the speaker is off screen. Ultimately, the system produces a detailed, play-by-play annotation of the video.

“The general idea is that you want to get more information without having people capture it,” says Alex Berg at the Computer Vision Group at University of California, Berkeley. “If you want to find a particular scene with a character, you have to first find the scenes that contain that character.” He says that Everingham’s research will pave the way for more complex searches of television programming.

Computer scientist Josef Sivic at Oxford’s Visual Geometry Group, who contributed to the Buffy project, says that in the future it will be possible to search for high-level concepts like “Buffy and Spike walking toward the camera hand-in-hand” or all outdoor scenes that contain Buffy.

Timothy Tuttle, vice president of AOL Video, says, “It seems like over the next five to ten years, more and more people will choose what to watch on their own schedule and they will view content on demand.” He also notes that the barrier to adapting technologies like Everingham’s may no longer be technical, but legal.

These legal barriers have been coming down with print media because companies have reaped the financial benefits of searchable content–Google’s Book Scan and Amazon’s search programs have been shown to boost book sales over the last two years.

But it’s unclear whether a searchable video can increase DVD sales in the same way. Currently, Google offers teasers of premium video content, says staff scientist Michele Covell. For certain genres, like sports videos, it’s becoming easier to select a teaser clip that will encourage people to buy the video, she says.

Shumeet Baluja, a staff research scientist at Google, agrees that annotating video over the Web will be a challenge, but over time they’ll be able to add more and more metadata to popular video clips offline, which will improve the speed and accuracy of searches.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.