The Search for a Clearer Voice

How Google’s Voice Search is getting so good.

Paul Boutinarchive page

January 10, 2011

Smart phones are great at a lot of things, with one exception: Typing on a touch screen or a downsized keyboard is still frustrating compared to a full-size computer keyboard. That’s probably why Google says that, even before the release of its new personalized Voice Search app for Android in mid-December, one in four mobile searches were already input by voice rather than from a keyboard.

The improved Voice Search takes speech recognition to its next level: Google’s servers will now log up to two years of your voice commands in order to more precisely parse exactly what you’re saying.

In tests on the new app, which appeared in Google’s Android Market a week before Christmas, the app originally got about three out of five searches correct. After a few days, the ratio crept up to four out of five. It’s surprisingly good at searches that involve common nouns (“heathen child lyrics”) and what search experts call vertical searches for popular topics like airline flights and movie listings. Voice Search knows “United Flight 714” and “True Grit show times 90066” when it hears them. Less successful are searches involving people’s names. In repeated attempts to Google up WikiLeaks founder Julian Assange, Voice Search got no closer than “wikileaks founder julian of songs.”

How does it work? Rather than try to use the phone itself to do speech recognition, Voice Search digitizes the user’s input commands and sends them off to Google’s gargantuan server farms. There, the spoken words are broken down and compared both to statistical models of what words other people mean when they utter those syllables, plus a history of the user’s own voice commands, through which Google refines its matching algorithm for that particular voice. The app recognizes five different flavors of English—American, British, Australian, Indian and South African—plus Afrikaans, Cantonese, Czech, Dutch, French, German, Italian, Japanese, Korean, Mandarin, Polish, Portugese, Russian, Spanish, Turkish, and Zulu.

The tricky part—and the motive for a personalized search app—is that human voices vary wildly between men and women, between young people and old people, and among those with various accents and dialects. By storing hundreds, perhaps thousands of what speech recognition experts call “utterances” by the same person over months of use, Voice Search can better guess at what that particular person is saying.

That mathematical model used to recognize phrases was refined over three years using voice samples from Google’s now-defunct GOOG-411 automated directory assistance service, which the company operated from 2007 through late last year specifically to capture a wide-ranging set of voice samples for analysis. The company’s first Voice Search app, for iPhone only, was launched a year after GOOG-411 in November 2008.

Voice Search doubles as a spoken-command system for the phone. As shown in this video, it understands commands such as, “Send mail to Mike LeBeau. How’s life in New York treating you? The weather’s beautiful here.” The app will find LeBeau in your contacts—it’s better at matching names here than in a Web search, because it’s working with a limited set—and will fill in the subject line with your first sentence. You can speak additional text into the message, or edit it with the phone’s keyboard, before sending it.

Google has clearly put a lot of effort into its speech recognition technology. But the impact on it bottom line is obvious: By removing the aggravation of typing on tiny keys, the company hopes to get customers to reach for its search and e-mail services much more often.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.