EveryZing’s underlying technology is composed of two basic technologies from Boston-based BBN. The core speech-to-text system, called Byblos, has been funded by $50 million of research money based on a series of government grants over the past five years, says Wilde. Using probabilistic machine learning algorithms, the system takes one minute to convert each minute of audio content into text.
The second part of the technology, says Wilde, is the algorithms that process the content of the text. BBN’s natural language technology contains huge stores of phrases and words for context, which helps it make sense of a video. For instance, a news segment about health might use language that’s specific to the medical field. In this case, the system would be able to recognize certain obscure words. Understanding the meaning of the text is a powerful tool, says Wilde, because it lets EveryZing provide high-level concepts to users so that they can fine-tune their search. And importantly, it enables the company to pair targeted ads with the right content.
The time is right for a video search engine with these capabilities, says Carnegie Mellon’s Stern. “Video is a much more compelling and entertaining medium than just plain text,” he says, and now so much of it is available on the Internet. He adds that BBN’s 80 percent accuracy is “really quite a feat,” and it should be adequate for searching the troves of content online.
While the technology is good, it’s not perfect, says EveryZing’s Wilde. The accuracy drops when background music is present and if there are multiple people talking at once. But for the infotainment and news market that the company is targeting right now, the technology should offer a significant improvement over what’s currently available, he says. “I think we’ll look back in a couple of years and say, ‘Of course the content of multimedia files needs to be searchable,’” says Wilde. “It’d be as if the Web pages could only be searched by title and tag.”
Hear more from Google at EmTech 2014.