Phones Pick Up Language

Faster chips and better software help mobile devices recognize speech.

Mara E. Vatzarchive page

October 1, 2004

Cell phones and wireless PDAs have one perennial problem: either no keyboard or a very small one. That makes typing anything more than a phone number a tedious, fumbling task. But a solution is on the way: mobile devices that are adept at recognizing spoken language.

Some cell phones already use speech recognition as an alternative to keypad entry for simple tasks such as dialing a number, but someday soon you may also find yourself dictating a text message into your phone, asking your car for directions, or telling your MP3 player that you want to listen to the Beatles. Indeed, today’s high-end cell phones are capable of running sophisticated speech recognition software that could eventually mean the end of pecking at keyboards. “The fundamental problem of inputting information into mobile devices is the interface, and voice overcomes that,” says Rich Geruson, CEO of VoiceSignal, a speech technology company based in Woburn, MA.

While companies like IBM and Dragon Systems (now part of Peabody, MA-based ScanSoft) have been selling desktop speech recognition software for more than a decade, mobile devices with even limited speech recognition abilities appeared only several years ago. And until now, such devices have largely been “speaker dependent” – meaning they work well only for their principal users and have to be trained to recognize individual words.

Faster processors and more efficient software, however, are enabling new speaker-independent systems that can recognize the speech of any user and require no training. These systems can discern thousands, rather than dozens, of names and are designed to work even when the speaker is in a noisy environment, such as the front seat of a speeding car.

For the engineers at VoiceSignal, the key to this advance was a shift in focus from accuracy to efficiency. The highly accurate speech recognition algorithms designed for desktop computers are too complex to run on mobile devices. Traditional algorithms for mobile devices required less processing power, but because they worked by matching the sound wave of an entire word to a sound wave stored in a device’s memory, they were limited to a small vocabulary.

Instead of storing an entire sound wave for each word in its lexicon, VoiceSignal’s new system stores information about phonemes – the smallest units of recognizable speech. Every phoneme can be described according to a set of acoustic parameters, such as pitch. The software measures a user’s utterances along these parameters and then looks for words that match. Parameter values take up less memory than audio files, so the software can handle a larger vocabulary without requiring any additional storage space.

And that’s opening up applications beyond simple voice dialing. For example, VoiceSignal offers software that lets users jump to any node of a cell phone’s menu with a single utterance. “If you try to send a [text] message on your phone right now, you have to do about ten clicks just to get to the message space,” says Geruson. “With our technology, you just say, ‘Send message to John Smith’s mobile,’ and your cursor is flashing and ready to go.” The first phone with this capability was released in August. Within six months the company also plans to release software for phones that lets users dictate text messages and e-mails, which Geruson anticipates will be particularly useful in Asia. “If you think it’s hard to input into a keyboard in Western or Latin languages, think about the problem in Japan or China, where you have thousands of characters,” he says.

Researchers at ScanSoft, meanwhile, are putting speech recognition to use in cars. A car kit with a built-in microphone, speakerphone, and ScanSoft speech engine provides motorists with a hands-free interface for their cell phones. A phone equipped with a Bluetooth wireless transmitter can be placed anywhere in a car, and drivers can use voice commands to dial, accept or reject calls, adjust the volume, and control menu options, all without taking their hands off the wheel or their eyes off the road.

While the wireless industry has been the first to embrace speech recognition, makers of consumer electronics appear to be close behind. At the Mitsubishi Electric Research Laboratories in Cambridge, MA, researchers are developing software that enlists speech to simplify the task of searching for information. Rather than scrolling through 10,000 MP3 songs on a handheld device, for instance, a user could select a single song just by saying its name – or that of a band or album. “We decided one of the things [speech] was good at was choosing,” says Mitsubishi speech technology researcher Peter Wolf.

Despite these advances, however, it remains to be seen how widely speech recognition will be adopted. Phone users may feel uncomfortable dictating personal e-mails in public. And they may always want keyboards for entering sensitive information such as credit card numbers. But Geruson predicts that the technology will eventually transform the way people use mobile devices. As a few early adopters take to the technology, he says, “it will catch on, and then it will be everywhere.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.