Hearing What We Want to Hear

An inside look at research at MIT.

Karen Chenauskyarchive page

April 1, 1997

Say pet. Now say pat. Hear the difference? Of course. Now say one halfway between the two. Can’t do it? Or can you just not hear that you’re doing it?

Since the 1960s, speech scientists have known that although it’s possible to produce sounds that are acoustically halfway between two recognized vowels, speakers never hear them halfway-they perceive them as one or the other. Even when people listen to a series of synthetic vowels, progressing in equal acoustic steps between that in, say, pat and pet, they hear a series of pats followed by a series of pets. Now researchers at MIT and elsewhere are building theories to explain this phenomenon, called categorical perception, so they can better understand how we hear speech. Such research could have implications for learning second languages, and might even help computers understand us better.

It’s well known that our ability to discriminate between different versions of the same vowel isn’t uniform for all gradations of that vowel. Some are harder to distinguish from their neighbors than others. Researchers think that variations in our sensitivity to small acoustic differences between speech sounds may help us categorize and interpret the sounds of our native languages.

Louis Braida, a professor in the Sensory Communication Group at MIT’s Research Laboratory of Electronics (RLE), set out to map native speakers’ sensitivity to English vowels. When he asked subjects whether they could distinguish between slightly different vowels, he found that discrimination ability is highest not between the most perfect example of each vowel and its immediate neighbors, but between examples near the category boundaries, where one vowel is on the verge of being perceived as a different vowel. In earlier research, Braida and his colleague Nathaniel Durlach had discovered a similar pattern in people’s sensitivity to variations in loudness. Subjects found it harder to discriminate pairs of faint tones than pairs of loud tones, as expected. But they got a boost in sensitivity when comparing subtly different examples at the extremes of a set of tones, around the loudest or the softest examples. They had been the least sensitive to variations square in the middle of the range.

Braida theorizes that people categorize sounds by estimating how far each is, acoustically, from what he calls “perceptual anchors”-memorable stimuli located at the edges of a range of examples. To discern where a particular vowel sound falls in relation to the extremes, he says, “you gauge the difference from each anchor with a ‘perceptual ruler’ that measures in units of a just-noticeable difference.” The farther the vowel is from an anchor, however, the blurrier the ruler becomes and the less accurately the sound is perceived. For Braida, the imperfection of the ruler corresponds to the limitations of our basic auditory resolution ability.

Categorical perception may also be influenced by our native languages, according to Patricia Kuhl, a professor in the Department of Speech and Hearing Sciences at the University of Washington. Using almost 100 synthesized versions of a particular vowel-the long-e sound, as in Pete-Kuhl asked subjects to rate each sample on a scale from one to seven. A certain region of the vowel space, a “sweet spot” if you will, consistently got the best ratings. Kuhl calls this region the prototype. Like Braida, she discovered that listeners’ sensitivity to differences is lowest in this middle region and highest at the edges of the range. She attributes this variation to a “perceptual magnet effect” in the middle region. “The prototype appears to act like a magnet for other sounds in the category,” says Kuhl. “It seems to perceptually ‘assimilate’ nearby sounds, making it difficult for people to hear any differences between the prototype and these other sounds.”

Nature Versus Nurture

Kuhl believes that the special status of prototype vowels becomes impressed on our minds early in life. Research performed on American and Swedish babies by Janet Werker, a professor of psychology at the University of British Columbia, suggests that, by the age of 10 or 12 months, infants lose the ability to hear distinctions that do not occur in their native languages. On the basis of such research, Kuhl concludes that categorical perception has a learned component: people with different native languages have different prototypes for their vowels, and thus different boundaries and different regions of sensitivity.

To what extent categorical perception is learned, as opposed to innate, has yet to be determined, but knowing more about how the phenomenon arises could conceivably result in better speech recognition systems. An advantage that categorical perception seems to bestow on humans is that it helps us reduce the effect of variability in speech. “Listeners,” Kuhl says, “must be able to categorize, or ‘render equivalent,’ the sounds produced by different people, even though the sounds are very different acoustically.” But variability is a problem for speech recognition systems: strong accents, for example, can render certain words unintelligible to machines. Kenneth Stevens, who heads the Speech Communication Group at MIT’s RLE and whose research interests include speech recognition, believes computers could be programmed to pay less attention to details that humans don’t notice. “The design of speech-recognition systems,” he says, “should take into account that the human auditory-brain system is inherently sensitive to certain attributes of sounds in speech and not sensitive to others.”

Understanding the delicately layered structure in our ability to perceive speech could also yield insights into the teaching and learning of foreign languages. The same categorical perception that numbs us to certain phonetic differences in our native tongues can reduce our sensitivity to crucial distinctions in other languages. As Kuhl has written, “The phonetic categories of the native language are analogous to a perceptual sieve… . The phonetic units of the newly acquired language must pass through the sieve, making distinctions in the new language imperceptible.” This may be why speakers of Japanese, for example, have trouble hearing the l/r distinction in English: the prototype to which they have become attuned falls somewhere between the two consonants. Explicit training in hearing such distinctions may help second-language learners pick up a new accent more efficiently. And further research might give us a better idea of how well adult immigrants can be expected to learn new languages.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.