The Search for Voice Activation

Google’s voice interface patent gives life to rumors that voice-actived mobile search will soon be a reality.

Kate Greenearchive page

April 21, 2006

Google is mum on how – if at all – it plans to use its recently granted patent for a voice-enabled search engine – despite the fact that it has also hired several speech-recognition researchers.

Originally filed in February 2001, the patent was granted for “Voice interface for a search engine,” in a move that likely signals some level of development at Google on voice technology for searching the Web using handhelds. Further fueling speculation is Google’s poaching of several speech-recognition specialists – the kind of move that often signals that a new product is afoot.

“They’ve put together a very strong group of people who are experts in speech-recognition technology,” says Nelson Morgan, director of the International Computer Science Institute, which is affiliated with the University of California at Berkeley. He explains that Google picked up a number of engineers from Nuance, “the big gun in speech-recognition technology,” which recently lost researchers after it was purchased by Scansoft, a speech-technology company (the combined company is called Nuance).

Investing in voice-search technology is a prescient move by Google, since the mobile Web appears to be at a tipping point. Globally, 28 percent of all mobile-phone users now surf the Web (up three percent over last year), according to Ipsos Insight, a Paris-based market research firm. In the United States, three out of every four homes now have mobile phones, and increasingly those phones are being used for more than just talking. In 2005, over half (52 percent) of all mobile-phone users had either sent or received a text message and 37 percent had sent or received an e-mail message.

Despite this growth, though, searching the Internet on handhelds can be maddening because of tiny keyboards and unfriendly designs, which require, at best, a stylus or, at worst, a series of clicks through numerous menu screens. Speech recognition software may provide a better way. And a handful of companies are exploring the technology. San Diego’s V-Enable offers voice search services to help users find ring-tone listings, and Promptu of Menlo Park, CA plans to offer similar services this summer.

At their most fundamental, speech recognition systems are composed of three major functions. First, words are captured and translated into a digital signal. Then a speech-recognition algorithm compares those signals to words and phrases from a pre-set dictionary, of, say, ring-tone options or movie listings. Finally, the software offers the most likely match for the spoken phrase.

Voice-activated systems, while getting better, can still sometimes get the answers wrong, explains Morgan. The software matches the voice input to a number of possibilities and often asks the user if the system’s highest-ranked word was the one he or she intended, or it might say the word was unintelligible. These systems work best when the algorithm needs to access only a small dictionary, he adds, such as ring-tone options, movie listings, or phone-book contacts.

That methodology, though, would be impractical for Web searches, since it’s impossible to limit the search dictionary while providing access to the eight billion-plus websites that Google searches.

Google’s voice search patent approaches the problem by taking a step back from simply plugging standard voice-recognition technology into standard search technology, says Morgan. Instead of trying to accurately predict the single-best guess about what a person is saying, the technology would take a handful of word and phrase possibilities and throw them at the powerful Google search engine. In this way, the voice search system may not need the most accurate speech recognition technology. Instead, it relies on Google’s strength – its search algorithm – to supply the most likely result for a number of possibilities.

This strategy of outsourcing the translation and searching to remote Google servers has its benefits, says Jordan Cohen, senior scientist at SRI International in Palo Alto, CA. It reduces the level of complexity in the software on the phone, and therefore wouldn’t use as much processing power, memory, or energy. Instead, it relies on the strength of the network – and some faith that the technology would be able to deal with the uncertainty of both vocal input and an intended search. If speech technology is instead confined to the mobile device, Cohen says, the software can “count on the person to fix things up” when the algorithm can’t find the most suitable word.

Because mobile voice search has barely begun to take form, it remains an open question whether Google’s approach will be employed – or produce the best product. But the market for mobile search in general is “going to be immense,” says Cohen. “The Google patent is an attempt to stake out a claim in that space.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.