AT&T Wants to Put Your Voice in Charge of Apps

Cloud-based speech and translation technology could allow any app to be voice-controlled.

David Talbotarchive page

April 23, 2012

In an effort to make speech the dominant way that people control technology, AT&T is opening up its speech-recognition technology for others to use. Starting in June, software engineers can tap into a cloud service offered by the company to make any device that can connect to the Internet respond to its master’s voice.

**Qué?:** A lab demonstration shows a real-time conversation between an English and a Spanish speaker. Spoken words were rendered into text, translated, and spoken by machine to the listener on either end.

AT&T believes the technology could ultimately be used for everything from smart-phone apps and online games to cars and appliances. While the initial offering will only convert speech into text, and corresponding commands, the company is considering a broader set of offerings later, including ones that translate English text into six other languages and vice versa, and can also synthesize translated speech.

“We believe there are a lot of smart people out there who can create applications and services we have never dreamt of before,” says Mazin Gilbert, vice president for intelligent systems research at AT&T Labs in Florham Park, New Jersey. To use the technology, developers write code into their software to take advantage of an API (application programming interface) specified by AT&T. That code causes an application to send speech to AT&T over the Internet, where it is converted to text and returned to the device. The new APIs were announced last week. AT&T claims the technology is 95 percent accurate in taking English speech and rendering it as text. It says its accuracy at converting the meaning of English text to and from other languages ranges from 70 percent to 80 percent.

The underlying speech technology now being offered by AT&T is already used in many of its own applications, including the AT&T translator app for Android and iOS phones, and mobile voice directory search provided by Yellow Pages. “I want to be able to have a million apps riding on our platform, not hundreds, as we have today,” Gilbert says. “Whatever your wild idea is—we want to provide those APIs. I’ll be honest: I don’t know what people are going to use it for.”

The AT&T technology builds on decades of innovation at Bell Labs prior to the breakup of AT&T and the subsequent establishment of AT&T’s own service-centric labs. However, the company must compete with more established providers of speech-recognition technology, especially in the realm of smart phones.

For example, Nuance provides speech-recognition capabilities to many companies including, reportedly, Apple for its Siri personal assistant. Google’s speech-recognition technology is offered throughout its Android smart-phone operating system, and by any app written for an Android device. Microsoft also has speech-recognition technology, which appears in its Windows Phone operating system and in products from partners such as Ford, with its Sync system for in-car entertainment.

Krish Prabhu, CEO of AT&T Labs, believes that making speech technology widely available will allow mobile computing to be more capable and grow faster. “In the context of a world where we’ve largely solved connectivity and reach problems—though there are still issues—this effort on speech comes from a conviction that the interface to the network has to get simpler,” he said at a lab demonstration in New York City last week. “We are trying to pave the way so that technology is not the thing that stops us.”

AT&T’s APIs for speech-to-text, to launch in June, consist of seven versions tailored to specific uses, such as dictating text messages, searching for local businesses, responding to questions, turning voice-mails into text, and performing general dictation. In the future, specific APIs for online games and social networks will also be added.

Later, APIs may become available that translate text between English and six other languages: Spanish, French, Italian, German, Chinese, and Japanese. Other languages, including Korean and Arabic, are in the pipeline, but AT&T will be far behind competitors. For example, Google already offers developers tools that can translate between any of over a thousand language pairs.

Gilbert says the use of all the APIs would carry a $99 registration fee for 2012, and that post-2012 plans were not public. Google charges for its own translation APIs.

Improving the accuracy of speech-recognition or translation software requires getting more example data to train the underlying algorithms. To help that process, AT&T could eventually solicit feedback from people using products that have its speech and translation technology built-in. “Crowdsourcing would enable this to reach much higher levels of accuracy, and this would, in turn, drive broader adoption and much happier users,” says Sam Ramji, a computer scientist who is vice president for strategy at Apigee, which builds API platforms and is working on the AT&T project.

Ramji believes that making good speech-recognition technology easily available could slowly make traditional menus and text-driven interfaces extinct. “Today’s user interfaces are like trees that we have to navigate to reflect the structure of the program. What should happen is that devices parse the command coming out of our mouths,” he says.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.