Google’s acquisition this week of YouTube.com has raised hopes that searching for video is going to improve. More than 65,000 videos are uploaded to YouTube each day, according to the website. With all that content, finding the right clip can be difficult.
Now researchers have developed a system that uses a combination of face recognition, close-captioning information, and original television scripts to automatically name the faces on that appear on screen, making episodes of the TV show Buffy the Vampire Slayer searchable.
“We basically see this work as one of the first steps in getting automated descriptions of what’s happening in a video,” says Mark Everingham, a computer scientist now at the University of Leeds (formerly of the University of Oxford), who presented his research at the British Machine Vision Conference in September.
Currently, video searches offered by AOL Video, Google, and YouTube do not search the content of a video itself, but instead rely primarily on “metadata,” or text descriptions, written by users to develop a searchable index of Web-based media content.
Users frequently (and illegally) upload bits and pieces of their favorite sitcoms to video-sharing sites such as YouTube. For instance, a recent search for “Buffy the Vampire Slayer” turned up nearly 2,000 clips on YouTube, many of them viewed thousands of times. Most of these clips are less than five minutes and the descriptions are vague. One titled “A new day has come,” for instance, is described by a user thusly: “It mostly contains Buffy and Spike. It shows how Spike was there for Buffy until he died and she felt alone afterward.”
Everingham says previous work in video search has used data from subtitles to find videos, but he’s not aware of anyone using his method, which combines–in the technical tour de force–subtitles and script annotation. The script tells you “what is said and who said it” and subtitles tell you “what time something is said,” he explains. Everingham’s software combines those two sources of information with powerful tools previously developed to track faces and identify speakers without the need for user input.
What made the Buffy project such a challenge, Everingham says, is that in film and television, the person speaking is not always in the shot. The star, Buffy, may be speaking off-screen or facing away from the camera, for instance, and the camera will be showing you the listener’s reactions. Other times, there may be multiple actors on the screen or the actor’s face is not directly facing the camera. All of these ambiguities are easy for humans to interpret, but difficult for computers–at least until now. Everingham says their multimodal system is accurate up to 80 percent of the time.