More-Accurate Video Search

Speech-recognition software could improve video search.

Kate Greenearchive page

June 12, 2007

Boston-based startup EveryZing has launched a search engine that it hopes will change the way that people search for audio and video online. Formerly known as PodZinger, a podcast search engine, EveryZing is leveraging speech systems developed by technology company BBN that can convert spoken words into searchable text with about 80 percent accuracy. This bests other commercially available systems, says EveryZing CEO Tom Wilde.

**Audio cues:** A new video and audio search engine can convert audio into a text transcript with 80 percent accuracy. That’s good enough to show snippets of the transcript, direct users to the spot in the file where a search term appears, and summarize key concepts.

This high accuracy is enabling new search capabilities, Wilde says, such as the ability to provide entire transcripts of video and audio, and the ability to direct people to the exact spot in a file where a word or phrase is spoken. The technology will also let the company provide targeted ads associated with specific content, much in the way that Google provides ads based on the text of a Web page.

“The big challenge [in online video and audio] … is the opaqueness of media content,” says Wilde. It’s extremely difficult to know what range of content is inside a video or audio clip. “The problem we want to solve,” he says, “is the discoverability of multimedia within Web search.” EveryZing does this by extracting the content of multimedia files and outputting text so that it can take advantage of the preexisting text-search tools developed by the likes of Google and Yahoo.

The Web is exploding with multimedia from YouTube, podcasts, TV news reports, and National Public Radio shows. But it’s still difficult to search for “Barack Obama” and pull up all the instances on the Web in which his name is mentioned. Typically, the titles of clips and the tags that people assign to them don’t contain enough information to give useful search results. And this is why a handful of companies over the past couple of years are exploring using audio content as a guide. For instance, video search engine Blinkx uses speech-recognition technology to scour the entire Web for relevant content, aggregating it on a single site, much as Google aggregates Web pages. (See “Surfing TV on the Internet.”)

EveryZing’s business goals differ from Blinkx’s, says Wilde, and he suspects that the two approaches can complement each other. “We’re about merchandising content, not trolling the Web,” he says. EveryZing (which, like Blinkx, provides a search portal for Web surfers) mainly wants to partner with content providers to make their multimedia searchable. For instance, the company wants to convert all the audio and video content within ABC.com into searchable text, adding time stamps to that text (as well as preexisting closed-captioned text) so a person can immediately jump to a specific word in a clip.

In addition, unlike Blinkx’s current technology, BBN’s technology lets EveryZing extract high-level concepts that originally might not have been searched for. If someone searched for “Barack Obama,” for instance, EveryZing might also offer other keywords in the clip, such as “rally.”

The idea of using audio transcripts to search for multimedia has been around in research labs for decades, and basic speech-recognition research dates back even earlier. Much of the seminal work occurred at BBN, MIT, Carnegie Mellon University, IBM, and SRI International. In 1995, Carnegie Mellon had a working demonstration of a similar video search system, says Richard Stern, professor of electrical and computer engineering at the university. This system, called Informedia, spurred other research in the field, he says, and was the precursor to BBN’s modern video analysis approach.

EveryZing’s underlying technology is composed of two basic technologies from Boston-based BBN. The core speech-to-text system, called Byblos, has been funded by $50 million of research money based on a series of government grants over the past five years, says Wilde. Using probabilistic machine learning algorithms, the system takes one minute to convert each minute of audio content into text.

The second part of the technology, says Wilde, is the algorithms that process the content of the text. BBN’s natural language technology contains huge stores of phrases and words for context, which helps it make sense of a video. For instance, a news segment about health might use language that’s specific to the medical field. In this case, the system would be able to recognize certain obscure words. Understanding the meaning of the text is a powerful tool, says Wilde, because it lets EveryZing provide high-level concepts to users so that they can fine-tune their search. And importantly, it enables the company to pair targeted ads with the right content.

The time is right for a video search engine with these capabilities, says Carnegie Mellon’s Stern. “Video is a much more compelling and entertaining medium than just plain text,” he says, and now so much of it is available on the Internet. He adds that BBN’s 80 percent accuracy is “really quite a feat,” and it should be adequate for searching the troves of content online.

While the technology is good, it’s not perfect, says EveryZing’s Wilde. The accuracy drops when background music is present and if there are multiple people talking at once. But for the infotainment and news market that the company is targeting right now, the technology should offer a significant improvement over what’s currently available, he says. “I think we’ll look back in a couple of years and say, ‘Of course the content of multimedia files needs to be searchable,’” says Wilde. “It’d be as if the Web pages could only be searched by title and tag.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.