Computers will really understand what you say when they know how you feel when you say it.
Sometimes it’s not what you say, but how you say it. That’s a truism most people can relate to-but computers can’t. While speech recognition software has gotten quite good at understanding words, it still can’t discern punctuation like periods and commas, or choose between ambiguous sentences whose meanings depend on the speaker’s emotion. That’s because such software still can’t make sense of the intonations, emphases and pauses-collectively known as prosody-that people intuitively use to make such distinctions.
But with more than a hundred corporate and academic research groups working on the problem, attempts at incorporating prosody into speech software are enjoying growing success. Prosody-based tools are already used for speech synthesis-to improve the naturalness of computer voices like those that recite your bank balance over the telephone. As prosody research advances, these automated systems will sound more and more natural even when speaking complex sentences.
It’s in speech recognition, however, that the most critical benefits should come. Recent advances suggest that within five years, prosody software will perform tasks such as telling when a customer speaking to an automated telephone system is getting angry and respond accordingly-by, say, transferring the call to a human. “This is really cutting-edge stuff,” says Michael Picheny, computer scientist and manager of the speech and language algorithms group at IBM Research in Yorktown Heights, NY. “Until the last couple of years, the quality of speech recognition was so primitive that it wasn’t even worth exploring how to elicit different behaviors from machines by conveying emotional intent. First you just wanted the machine to get the words right.”
Sound waves have three basic components that prosody software can work with. The first is timing-the duration of pauses between words and sentences. Second comes pitch-more precisely, the relative change in pitch within words and sentences. Lastly, there is volume-an amplitude change indicating emphasis.
Gleaning meaning from these features is much tougher for a computer than identifying words, says Elizabeth Shriberg, a computer scientist who leads prosody research at SRI International in Menlo Park, CA. Words are a linear series of phonetic events, such as “ee” and “th.” Prosodic features, by contrast, occur across words and sentences. Worse, different kinds of prosodic patterns often overlap one another; one set might reveal that a sentence was spoken calmly, a second that the sentence was a question. But researchers are beginning to map them. For example, Shriberg and her coworkers have created a template of an angry sentence: it’s slower overall, has an exaggerated emphasis on key words and ends with a downward turn in pitch (“I needed this last Tuesday but it hasn’t arrived”). Shriberg has generated prosodic models of everything from different emotional states to punctuation to “disfluencies”-shifts when people change thoughts midsentence or mumble “uh.”
While SRI’s tools are still research exercises, wide-scale use of prosody for speech recognition could be just over the horizon. Limited applications are already used in Chinese speech recognition software. In China, depending on inflection, the word “Ma” can mean “mom” or “horse,” or indicate that the sentence was a curse, explains Xuedong Huang, a computer scientist who heads a speech technology group at Microsoft Research in Redmond, WA. “That’s very dangerous,” he jokes. Microsoft has already begun incorporating simple prosody into its Chinese language speech recognition software and is working to create next-generation software for Chinese and Japanese languages.
The next five years should see English-language prosodic tools for speech recognition make their first market forays. One application: companies could automatically search recorded customer service telephone databases, find the angry calls and study what went wrong. Another possibility: identifying punctuation, so doctors speaking into dictation systems won’t have to say “period.”
But it will take at least ten years, says SRI’s Shriberg, before any computer can begin to do what people do every day-completely decode a conversation with all its inflection, while filtering out background noises. “We are trying to close the [man-machine] gap somewhat, so when humans are in short supply, or in space, or on life support, the computer will be as smart as possible,” Shriberg says. For now, anyway, just don’t try explaining that ambition to a computer. It won’t understand your excitement.
Become an MIT Technology Review Insider for in-depth analysis and unparalleled perspective.Subscribe today