Skip to Content

Why Synthesized Speech Sounds So Awful

Most “synthesized” speech is actually a computationally-intensive playback that can’t transcend a monotone.

We have tricorders, teleportation and dynamic touch-screen interfaces, but not the most mundane prediction of Star Trek and countless other sci-fi franchises: human-like synthesized speech.

Those of you who haven’t listened to synthesized speech since the last time you watched A Brief History of Time, prepare to be underwhelmed by the lack of progress. Here’s Roger Ebert using a text to speech synthesizer pre-programmed with his own voice:

And here, just for reference, is something no less intelligible and only a smidge more robotic, only it happens to be about 25 years old and running on a computer with about 1/62,000th the memory:

If this is the state of the art, is it any wonder that the Author’s Guild no longer seems to care that the iPad, like the Kindle, can “read” a document aloud?

Granted, comparing Ebert’s Speech Generating Device (SGD) to Hawking’s reveals that we now have the ability to make a computer’s Robby-esque voice sound something like the person whose voice an SGD is meant to replace - a good first step in using these devices for sufferers from degenerative diseases like ALS or, in Ebert’s case, a loss due to cancer.

SGDs that sounds like an individual are possible because of what’s known as data-based speech synthesis or concatenative speech synthesis. This technique is used in concert with “voice banking,” in which a user who knows they will lose the power of speech records hours of it in advance.

Synthesized versus concatenated speech

Unlike truly synthesized speech, a herculean task requiring a programmer to generate a voice from scratch using only modifications of basic sounds, data-based speech synthesis draws on a library of hours of natural speech, playing back short sections of it in order to compose any word in the target language. It’s a bit like the difference between old-school music synthesizers and sampling.

Monophones, Diphones, Triphones…

Data-based speech synthesis has a number of problems. The first is that it composes speech from diphones - pairs of word sounds. This is fairly computationally intensive: every word the SGD speaks must be composed of multiple diphones which it must identify in its existing database.

This means thousands and thousands of diphones, and yet the words we speak are not merely concatenations of pairs of sounds; some words are collections of sounds unto themselves, and diphones common to two words might not sound right in a third, which could require a triphone or even something more. It’s easy to see how the number of possible combinations an SGD would have to chose from quickly becomes an intractable problem when moving beyond simple two-sound units.

The monotone problem

Even the best commercially-available concatenated speech systems do not even attempt to conquer the problem of emphasis. In normal speech, we convey emotions through a range of tricks - pauses, the timing of syllables, tone. Even in the lab, the best attempts at putting emotions like anger and fear in synthesized speech successfully convey these feelings only about 60% of the time (pdf here), and the numbers are even worse for joy.

Like artificial intelligence, speech recognition, and computer vision, speech synthesis is another one of the functions humans perform easily that we have so far found incredibly difficult to reproduce in silico.

Follow Mims on Twitter or contact him via email.

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

The problem with plug-in hybrids? Their drivers.

Plug-in hybrids are often sold as a transition to EVs, but new data from Europe shows we’re still underestimating the emissions they produce.

Google DeepMind’s new generative model makes Super Mario–like games from scratch

Genie learns how to control games by watching hours and hours of video. It could help train next-gen robots too.

How scientists traced a mysterious covid case back to six toilets

When wastewater surveillance turns into a hunt for a single infected individual, the ethics get tricky.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.