We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not a subscriber? Subscribe now for unlimited access to online articles.

Christopher Mims

A View from Christopher Mims

Why Synthesized Speech Sounds So Awful

Most “synthesized” speech is actually a computationally-intensive playback that can’t transcend a monotone.

  • August 24, 2010

We have tricorders, teleportation and dynamic touch-screen interfaces, but not the most mundane prediction of Star Trek and countless other sci-fi franchises: human-like synthesized speech.

Those of you who haven’t listened to synthesized speech since the last time you watched A Brief History of Time, prepare to be underwhelmed by the lack of progress. Here’s Roger Ebert using a text to speech synthesizer pre-programmed with his own voice:

And here, just for reference, is something no less intelligible and only a smidge more robotic, only it happens to be about 25 years old and running on a computer with about 1/62,000th the memory:

If this is the state of the art, is it any wonder that the Author’s Guild no longer seems to care that the iPad, like the Kindle, can “read” a document aloud?

Granted, comparing Ebert’s Speech Generating Device (SGD) to Hawking’s reveals that we now have the ability to make a computer’s Robby-esque voice sound something like the person whose voice an SGD is meant to replace - a good first step in using these devices for sufferers from degenerative diseases like ALS or, in Ebert’s case, a loss due to cancer.

SGDs that sounds like an individual are possible because of what’s known as data-based speech synthesis or concatenative speech synthesis. This technique is used in concert with “voice banking,” in which a user who knows they will lose the power of speech records hours of it in advance.

Synthesized versus concatenated speech

Unlike truly synthesized speech, a herculean task requiring a programmer to generate a voice from scratch using only modifications of basic sounds, data-based speech synthesis draws on a library of hours of natural speech, playing back short sections of it in order to compose any word in the target language. It’s a bit like the difference between old-school music synthesizers and sampling.

Monophones, Diphones, Triphones…

Data-based speech synthesis has a number of problems. The first is that it composes speech from diphones - pairs of word sounds. This is fairly computationally intensive: every word the SGD speaks must be composed of multiple diphones which it must identify in its existing database.

This means thousands and thousands of diphones, and yet the words we speak are not merely concatenations of pairs of sounds; some words are collections of sounds unto themselves, and diphones common to two words might not sound right in a third, which could require a triphone or even something more. It’s easy to see how the number of possible combinations an SGD would have to chose from quickly becomes an intractable problem when moving beyond simple two-sound units.

The monotone problem

Even the best commercially-available concatenated speech systems do not even attempt to conquer the problem of emphasis. In normal speech, we convey emotions through a range of tricks - pauses, the timing of syllables, tone. Even in the lab, the best attempts at putting emotions like anger and fear in synthesized speech successfully convey these feelings only about 60% of the time (pdf here), and the numbers are even worse for joy.

Like artificial intelligence, speech recognition, and computer vision, speech synthesis is another one of the functions humans perform easily that we have so far found incredibly difficult to reproduce in silico.

Follow Mims on Twitter or contact him via email.

Learn from the humans leading the way in intelligent machines at EmTech Next. Register Today!
June 11-12, 2019
Cambridge, MA

Register now
More from Intelligent Machines

Artificial intelligence and robots are transforming how we work and live.

Want more award-winning journalism? Subscribe to Print + All Access Digital.
  • Print + All Access Digital {! insider.prices.print_digital !}*

    {! insider.display.menuOptionsLabel !}

    The best of MIT Technology Review in print and online, plus unlimited access to our online archive, an ad-free web experience, discounts to MIT Technology Review events, and The Download delivered to your email in-box each weekday.

    See details+

    12-month subscription

    Unlimited access to all our daily online news and feature stories

    6 bi-monthly issues of print + digital magazine

    10% discount to MIT Technology Review events

    Access to entire PDF magazine archive dating back to 1899

    Ad-free website experience

    The Download: newsletter delivery each weekday to your inbox

    The MIT Technology Review App

You've read of three free articles this month. for unlimited online access. You've read of three free articles this month. for unlimited online access. This is your last free article this month. for unlimited online access. You've read all your free articles this month. for unlimited online access. You've read of three free articles this month. for more, or for unlimited online access. for two more free articles, or for unlimited online access.