Why Synthesized Speech Sounds So Awful

Most “synthesized” speech is actually a computationally-intensive playback that can’t transcend a monotone.

Christopher Mimsarchive page

August 24, 2010

We have tricorders, teleportation and dynamic touch-screen interfaces, but not the most mundane prediction of Star Trek and countless other sci-fi franchises: human-like synthesized speech.

Those of you who haven’t listened to synthesized speech since the last time you watched A Brief History of Time, prepare to be underwhelmed by the lack of progress. Here’s Roger Ebert using a text to speech synthesizer pre-programmed with his own voice:

And here, just for reference, is something no less intelligible and only a smidge more robotic, only it happens to be about 25 years old and running on a computer with about 1/62,000th the memory:

If this is the state of the art, is it any wonder that the Author’s Guild no longer seems to care that the iPad, like the Kindle, can “read” a document aloud?

Granted, comparing Ebert’s Speech Generating Device (SGD) to Hawking’s reveals that we now have the ability to make a computer’s Robby-esque voice sound something like the person whose voice an SGD is meant to replace - a good first step in using these devices for sufferers from degenerative diseases like ALS or, in Ebert’s case, a loss due to cancer.

SGDs that sounds like an individual are possible because of what’s known as data-based speech synthesis or concatenative speech synthesis. This technique is used in concert with “voice banking,” in which a user who knows they will lose the power of speech records hours of it in advance.

Synthesized versus concatenated speech

Unlike truly synthesized speech, a herculean task requiring a programmer to generate a voice from scratch using only modifications of basic sounds, data-based speech synthesis draws on a library of hours of natural speech, playing back short sections of it in order to compose any word in the target language. It’s a bit like the difference between old-school music synthesizers and sampling.

Monophones, Diphones, Triphones…

Data-based speech synthesis has a number of problems. The first is that it composes speech from diphones - pairs of word sounds. This is fairly computationally intensive: every word the SGD speaks must be composed of multiple diphones which it must identify in its existing database.

This means thousands and thousands of diphones, and yet the words we speak are not merely concatenations of pairs of sounds; some words are collections of sounds unto themselves, and diphones common to two words might not sound right in a third, which could require a triphone or even something more. It’s easy to see how the number of possible combinations an SGD would have to chose from quickly becomes an intractable problem when moving beyond simple two-sound units.

The monotone problem

Even the best commercially-available concatenated speech systems do not even attempt to conquer the problem of emphasis. In normal speech, we convey emotions through a range of tricks - pauses, the timing of syllables, tone. Even in the lab, the best attempts at putting emotions like anger and fear in synthesized speech successfully convey these feelings only about 60% of the time (pdf here), and the numbers are even worse for joy.

Like artificial intelligence, speech recognition, and computer vision, speech synthesis is another one of the functions humans perform easily that we have so far found incredibly difficult to reproduce in silico.

Follow Mims on Twitter or contact him via email.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.