A Browser that Speaks Your Language

The latest version of Google’s Chrome shows the potential of HTML5.

David Zaxarchive page

April 6, 2011

Early adopters can now get a sneak peek at the future of the Web by downloading the latest prerelease, or “beta,” version of Chrome, Google’s Web browser. One of the most interesting new features is an ability to translate speech to text—entirely via the Web.

The feature is the result of work Google has been doing with the World Wide Web Consortium’s HTML Speech Incubator Group, the mission of which is “to determine the feasibility of integrating speech technology in HTML5,” the Web’s new, emerging standard language.

A Web page employing the new HTML5 feature could have an icon that, when clicked, initiates a recording through the computer’s microphone, via the browser. Speech is captured and sent to Google’s servers for transcription, and the resulting text is sent back to the website.

To experiment with the voice-to-text feature, download the latest beta version of Chrome here. Then go to this webpage, click on the microphone, and start talking. You’ll probably find the results mixed, and sometimes hilarious. Using the finest elocution I could muster, I read the opening passage of Richard Yates’s Revolutionary Road: “The final dying sounds of their dress rehearsal left the Laurel Players with nothing to do but stand there, silent and helpless.” I got error messages several times in a row (“speech not recognized” or “connection to speech servers failed”). Once, I received this transcription: “9 sounds good restaurants on the world there’s nothing to do with fam vans island.”

The new feature derives in large part from experiments Google conducted through its Android operating system for mobile devices. For more than a year, says Vincent Vanhoucke, a member of Google’s voice recognition team, Android app developers have been able to integrate voice recognition into their apps using technology provided by Google. This has provided Google with useful voice data with which to train its voice-recognition algorithms. Today, some 20 percent of searches on Android phones are conducted using voice recognition, says Vanhoucke: people use voice recognition to write texts, send emails, or conduct searches. “It has really opened up interesting new avenues,” says Vanhoucke.

However, unlike desktop voice-to-text software, which first accustoms itself to a user’s voice, Chrome is trying to churn out text from voice without prior training.

undefined

“I suppose if they keep track of [the] IP address, they could adapt” to a given user’s voice, says Jim Glass, a speech recognition expert at MIT. Glass notes that the mobile phone provides an acoustic environment very different from that of a laptop or desktop computer; for one thing, a phone’s microphone is reliably placed right at the user’s mouth, unlike computer microphone setups in homes or offices. “This is the beta version of Chrome,” says Glass. “They’ll be collecting data, and we can be sure they will be refining their models–that’s the nature of the speech-recognition game.”

Even if it’s rough around the edges, sometimes the technology impresses. I tried once again and got back “the final warning sounds of the dress rehearsal at laurel players with nothing to do with stand there.” Not so bad. And the Chrome app nailed it to a letter when all I said was “the quick brown fox jumps over the lazy dog.”

Third-party programmers have also begun creating Web pages capable of using the new feature of Chrome. Already available for trial is a browser plugin called Speechify that lets you search Google, Hulu, YouTube, Amazon, and other sites using voice with Chrome.

Other inventive uses could soon follow. “Games could be taking keyboard, mouse, touch, accelerometer, and speech input together,” says Karl Westin, an expert on HTML5 who works for Nerd Communications, based in Berlin, Germany. “Having an aeroplane game where you could actually scream ‘up, UP, UUUPPP!’ could be fantastic.”

But the technology is more than just a toy—it also points the way to a much more capable Web. HTML4, the last major version of the HTML language, emerged in 1997. Since then, plugins like Silverlight and Flash have added media-processing capabilities to the Web. But HTML5 enables media playback and offline storage via the browser.

“The insight we had was that more and more people were spending all their time in the browser,” says Google’s Brian Rakowski, group product manager for Chrome. E-mail and instant messaging increasingly take place in browsers rather than in separate e-mail or AIM applications. “We’d like it to be case that you never have to install a native application again,” says Rakowski. “The Web should be able to do all of it.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.