Saying It with Feeling
At the Boston Computer Museum, next to a machine that guesses one’s height, sits an inconspicuous display consisting of a simple computer terminal. The exhibit, called “Synthetic Emotional Speech,” is intended more for the ears than for the eyes. The computer recites lines from the play Waiting for Godot and the Abbott and Costello routine “Who’s on First?” using any of six emotions the user specifies: annoyed, cordial, disdainful, distraught, impatient, or plaintive.
The idea of putting some feeling into computer-generated speech is part of an effort by Janet Cahn, a graduate student in the MIT Media Lab, to make such speech “come alive,” she says. Her principal motivation is to develop software that can help speech-impaired people communicate more effectively. “Most nonspeaking people are frustrated by the technical options currently available to them,” she says. “They don’t want to talk like a machine,” especially since emotions help convey a sense of the speaker’s mental state. Cahn adds that more authentic-sounding synthetic speech might lead to better computerized reading devices for blind people, emergency telephone systems that could provide information to callers in a calm, soothing voice, and even playback units that screenwriters might use to test dramatic dialogue.
The display at the Computer Museum is based on a program Cahn wrote called “Affect Editor” that alters the speech emitted by DECtalk, a standard synthesizer. When a user selects an emotion for the reading, the software assigns one of 21 integers (from -10 to 10) to each of numerous acoustical qualities representing aspects of pitch, voice quality, timing, articulation, and loudness. The program specifies that angry speech, for instance, is loud, high-pitched, quick, and characterized by irregular rhythms, inflections, and precise enunciation. Sad speech is soft, low-pitched, and slurred, displaying minimal variability and many pauses.
To create this software, Cahn drew on previous research that had determined acoustical qualities characteristic of various emotions. In the late 1960s, for example, investigators isolated features of fearful speech by conducting analyses of pilots’ voices just before their planes crashed. A 1972 study explored acoustical aspects of anguish by examining a recording of a radio announcer reporting the crash of the Hindenburg.
After incorporating such variables, Cahn fine-tuned the model by testing it on a small group of people. The 28 subjects correctly identified the emotions conveyed by Affect Editor 53 percent of the time-a promising finding considering that the subjects correctly guessed the emotional content of human speech with only slightly greater accuracy.
The doctoral student has recently turned her attention to a related problem: developing for synthesized speech a sense of what she calls speaking style. “A classical music host, rock DJ, and sports announcer on the radio could use almost identical words yet sound very different,” she explains. A speaker’s style can range from formal to informal, varying according to the audience and subject matter.
Relying on research findings similar to those she used to create Affect Editor, Cahn is creating software that can alter style by modifying rhythm, stress, and other variables that also contribute to the emotional aspects of speech. The style program is “definitely not ready for primetime,” she says. “But both emotional content and style are essential if we’re going to reproduce the range and variability of human speech.”
Cahn is one of a handful of researchers working in the area of natural-sounding synthetic speech. Across the Atlantic, Iain Murray, a computer scientist at the University of Dundee in Scotland, has spent the past 10 years developing HAMLET, software that produces emotional speech that experimental subjects have identified correctly about 50 percent of the time. Murray has demonstrated the potential range of a standard speech synthesizer by making it sing: he and his colleagues have used HAMLET and DECtalk to create the vocals for several pop records released in Britain. One album, a 1989 recording by the Love Child Orchestra, even cracked the Top 100.
Cahn anticipates several more years of research before developing practical, commercial applications. After completing linguistic research, the results will need translating into fast and flexible computer systems. Testing on larger groups will be necessary, as will meetings with groups of potential users.
The task is demanding, she says, because the choices people make regarding speech are so complex: “We have so many different ways of expressing ourselves.”
Become an MIT Technology Review Insider for in-depth analysis and unparalleled perspective.Subscribe today