Christopher Mims

A View from Christopher Mims

Why Synthesized Speech Sounds So Awful

Most “synthesized” speech is actually a computationally-intensive playback that can’t transcend a monotone.

  • August 24, 2010

We have tricorders, teleportation and dynamic touch-screen interfaces, but not the most mundane prediction of Star Trek and countless other sci-fi franchises: human-like synthesized speech.

Those of you who haven’t listened to synthesized speech since the last time you watched A Brief History of Time, prepare to be underwhelmed by the lack of progress. Here’s Roger Ebert using a text to speech synthesizer pre-programmed with his own voice:

And here, just for reference, is something no less intelligible and only a smidge more robotic, only it happens to be about 25 years old and running on a computer with about 1/62,000th the memory:

If this is the state of the art, is it any wonder that the Author’s Guild no longer seems to care that the iPad, like the Kindle, can “read” a document aloud?

Granted, comparing Ebert’s Speech Generating Device (SGD) to Hawking’s reveals that we now have the ability to make a computer’s Robby-esque voice sound something like the person whose voice an SGD is meant to replace - a good first step in using these devices for sufferers from degenerative diseases like ALS or, in Ebert’s case, a loss due to cancer.

SGDs that sounds like an individual are possible because of what’s known as data-based speech synthesis or concatenative speech synthesis. This technique is used in concert with “voice banking,” in which a user who knows they will lose the power of speech records hours of it in advance.

Synthesized versus concatenated speech

Unlike truly synthesized speech, a herculean task requiring a programmer to generate a voice from scratch using only modifications of basic sounds, data-based speech synthesis draws on a library of hours of natural speech, playing back short sections of it in order to compose any word in the target language. It’s a bit like the difference between old-school music synthesizers and sampling.

Monophones, Diphones, Triphones…

Data-based speech synthesis has a number of problems. The first is that it composes speech from diphones - pairs of word sounds. This is fairly computationally intensive: every word the SGD speaks must be composed of multiple diphones which it must identify in its existing database.

This means thousands and thousands of diphones, and yet the words we speak are not merely concatenations of pairs of sounds; some words are collections of sounds unto themselves, and diphones common to two words might not sound right in a third, which could require a triphone or even something more. It’s easy to see how the number of possible combinations an SGD would have to chose from quickly becomes an intractable problem when moving beyond simple two-sound units.

The monotone problem

Even the best commercially-available concatenated speech systems do not even attempt to conquer the problem of emphasis. In normal speech, we convey emotions through a range of tricks - pauses, the timing of syllables, tone. Even in the lab, the best attempts at putting emotions like anger and fear in synthesized speech successfully convey these feelings only about 60% of the time (pdf here), and the numbers are even worse for joy.

Like artificial intelligence, speech recognition, and computer vision, speech synthesis is another one of the functions humans perform easily that we have so far found incredibly difficult to reproduce in silico.

Follow Mims on Twitter or contact him via email.

Cut off? Read unlimited articles today.

Become an Insider
Already an Insider? Log in.

Uh oh–you've read all of your free articles for this month.

Insider Premium
$179.95/yr US PRICE

More from Intelligent Machines

Artificial intelligence and robots are transforming how we work and live.

Want more award-winning journalism? Subscribe to Insider Premium.
  • Insider Premium {! insider.prices.premium !}*

    {! insider.display.menuOptionsLabel !}

    Our award winning magazine, unlimited access to our story archive, special discounts to MIT Technology Review Events, and exclusive content.

    See details+

    What's Included

    Bimonthly home delivery and unlimited 24/7 access to MIT Technology Review’s website.

    The Download. Our daily newsletter of what's important in technology and innovation.

    Access to the Magazine archive. Over 24,000 articles going back to 1899 at your fingertips.

    Special Discounts to select partner offerings

    Discount to MIT Technology Review events

    Ad-free web experience

    First Look. Exclusive early access to stories.

    Insider Conversations. Listen in as our editors talk to innovators from around the world.

You've read all of your free articles this month. This is your last free article this month. You've read of free articles this month. or  for unlimited online access.