Where Speech Recognition Is Going

Voice-controlled interfaces are showing up in mobile phones, TVs, and automobiles. One company believes it can give just about everything a voice.

Will Knightarchive page

May 29, 2012

Until recently, the idea of holding a conversation with a computer seemed pure science fiction. If you asked a computer to “open the pod bay doors”—well, that was only in movies.

But things are changing, and quickly. A growing number of people now talk to their mobile smart phones, asking them to send e-mail and text messages, search for directions, or find information on the Web.

“We’re at a transition point where voice and natural-language understanding are suddenly at the forefront,” says Vlad Sejnoha, chief technology officer of Nuance Communications, a company based in Burlington, Massachusetts, that dominates the market for speech recognition with its Dragon software and other products. “I think speech recognition is really going to upend the current [computer] interface.”

Progress has come about thanks in part to steady progress in the technologies needed to help machines understand human speech, including machine learning and statistical data-mining techniques. Sophisticated voice technology is already commonplace in call centers, where it lets users navigate through menus and helps identify irate customers who should be handed off to a real customer service rep.

Now the rapid rise of powerful mobile devices is making voice interfaces even more useful and pervasive.

Jim Glass, a senior research scientist at MIT who has been working on speech interfaces since the 1980s, says today’s smart phones pack as much processing power as the laboratory machines he worked with in the ’90s. Smart phones also have high-bandwidth data connections to the cloud, where servers can do the heavy lifting involved with both voice recognition and understanding spoken queries. “The combination of more data and more computing power means you can do things today that you just couldn’t do before,” says Glass. “You can use more sophisticated statistical models.”

The most prominent example of a mobile voice interface is, of course, Siri, the voice-activated personal assistant that comes built into the latest iPhone. But voice functionality is built into Android, the Windows Phone platform, and most other mobile systems, as well as many apps. While these interfaces still have considerable limitations (see Social Intelligence), we are inching closer to machine interfaces we can actually talk to.

Nuance is at the heart of the boom in voice technology. The company was founded in 1992 as Visioneer and has acquired dozens of other voice technology businesses. It now has more than 6,000 staff members at 35 locations around the world, and its revenues in the second quarter of 2012 were $390.3 million, a 22.4 percent increase over the same period in 2011.

In recent years, Nuance has deftly applied its expertise in voice recognition to the emerging market for speech interfaces. The company supplies voice recognition technology to many other companies and is widely believed to provide the speech component of Siri.

Speech is ideally suited to mobile computing, says Nuance’s CTO, partly because users have their hands and eyes otherwise occupied—but also because a single spoken command can accomplish tasks that would normally require a multitude of swipes and presses. “Suddenly you have this new building block, this new dimension that you can bring to the problem,” says Sejnoha. “And I think we’re going to be designing the basic modern device UI with that in mind.”

Inspired by the success of voice recognition software on mobile phones, Nuance hope to put its speech interfaces in many more places, most notably the television and the automobile. Both are popular and ripe for innovation.

To find a show on TV, or to schedule a DVR recording, viewers currently have to navigate awkward menus using a remote that was never designed for keying in text queries. Products that were supposed to make finding a show easier, such as Google TV, have proved too complex for people who just want to relax for an evening’s entertainment.

At Nuance’s research labs, Sejnoha demonstrated software called Dragon TV running on a television in a mocked-up living room. When a colleague said, “Dragon TV, find movies starring Meryl Streep,” the interface instantly scanned through channel listings to select several appropriate movies. A version of this technology is already in some televisions sold by Samsung.

Apple is widely rumored to be developing its own television, and it’s speculated that Siri will be its controller. The idea has been fueled by Walter Isaacson’s biography of Steve Jobs, in which the late CEO is said to have claimed that he’d “finally solved” the TV interface.

Meanwhile, the Sync entertainment system in Ford automobiles already uses Nuance’s technology to let drivers pull up directions, weather information, and songs. About four million Ford cars on the road have Sync with voice recognition. Last week, Nuance introduced software called Dragon Drive that will let other car manufacturers add voice-control features to vehicles.

Both these new contexts are challenging. One reason voice interfaces have become popular on smart phones is that users speak directly into the device’s microphone. To ensure that the system works well in televisions and cars, where there is more background noise, the company is experimenting with array microphones and noise-canceling technology.

Nuance makes a number of software development kits available to anyone who wants to include voice recognition technology in an application. Montrue Technologies, a company based in Ashland, Oregon, used Nuance’s mobile medical SDK to develop an iPad app that lets physicians dictate notes.

“It’s astonishingly accurate,” says Brian Phelps, CEO and cofounder of Montrue and himself an ER doctor. “Speech has turned a corner; it’s gotten to a point where we’re getting incredible accuracy right out of the box.”

In turn, the kits shore up Nuance’s position, helping the company improve its voice recognition and language processing algorithms by sending ever more voice data through its servers. As MIT’s Glass says, “there has been a long-time saying in the speech-recognition community: ‘There’s no data like more data’.” Nuance says it stores the data in an anonymous format to protect privacy.

Sejnoha believes that within a few years, mobile voice interfaces will be much more pervasive and powerful. “I should just be able to talk to it without touching it,” he says. “It will constantly be listening for trigger words, and will just do it—pop up a calendar, or ready a text message, or a browser that’s navigated to where you want to go.”

Perhaps people will even speak to computers they wear, like the photo-snapping eyeglasses in development at Google. Sources at Nuance say they are actively planning how speech technology would have to be architected to run on wearable computers.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.