Skip to Content

Why Synthesized Speech Sounds So Awful

Most “synthesized” speech is actually a computationally-intensive playback that can’t transcend a monotone.

We have tricorders, teleportation and dynamic touch-screen interfaces, but not the most mundane prediction of Star Trek and countless other sci-fi franchises: human-like synthesized speech.

Those of you who haven’t listened to synthesized speech since the last time you watched A Brief History of Time, prepare to be underwhelmed by the lack of progress. Here’s Roger Ebert using a text to speech synthesizer pre-programmed with his own voice:

And here, just for reference, is something no less intelligible and only a smidge more robotic, only it happens to be about 25 years old and running on a computer with about 1/62,000th the memory:

If this is the state of the art, is it any wonder that the Author’s Guild no longer seems to care that the iPad, like the Kindle, can “read” a document aloud?

Granted, comparing Ebert’s Speech Generating Device (SGD) to Hawking’s reveals that we now have the ability to make a computer’s Robby-esque voice sound something like the person whose voice an SGD is meant to replace - a good first step in using these devices for sufferers from degenerative diseases like ALS or, in Ebert’s case, a loss due to cancer.

SGDs that sounds like an individual are possible because of what’s known as data-based speech synthesis or concatenative speech synthesis. This technique is used in concert with “voice banking,” in which a user who knows they will lose the power of speech records hours of it in advance.

Synthesized versus concatenated speech

Unlike truly synthesized speech, a herculean task requiring a programmer to generate a voice from scratch using only modifications of basic sounds, data-based speech synthesis draws on a library of hours of natural speech, playing back short sections of it in order to compose any word in the target language. It’s a bit like the difference between old-school music synthesizers and sampling.

Monophones, Diphones, Triphones…

Data-based speech synthesis has a number of problems. The first is that it composes speech from diphones - pairs of word sounds. This is fairly computationally intensive: every word the SGD speaks must be composed of multiple diphones which it must identify in its existing database.

This means thousands and thousands of diphones, and yet the words we speak are not merely concatenations of pairs of sounds; some words are collections of sounds unto themselves, and diphones common to two words might not sound right in a third, which could require a triphone or even something more. It’s easy to see how the number of possible combinations an SGD would have to chose from quickly becomes an intractable problem when moving beyond simple two-sound units.

The monotone problem

Even the best commercially-available concatenated speech systems do not even attempt to conquer the problem of emphasis. In normal speech, we convey emotions through a range of tricks - pauses, the timing of syllables, tone. Even in the lab, the best attempts at putting emotions like anger and fear in synthesized speech successfully convey these feelings only about 60% of the time (pdf here), and the numbers are even worse for joy.

Like artificial intelligence, speech recognition, and computer vision, speech synthesis is another one of the functions humans perform easily that we have so far found incredibly difficult to reproduce in silico.

Follow Mims on Twitter or contact him via email.

Keep Reading

Most Popular

computation concept
computation concept

How AI is reinventing what computers are

Three key ways artificial intelligence is changing what it means to compute.

still from Embodied Intelligence video
still from Embodied Intelligence video

These weird virtual creatures evolve their bodies to solve problems

They show how intelligence and body plans are closely linked—and could unlock AI for robots.

We reviewed three at-home covid tests. The results were mixed.

Over-the-counter coronavirus tests are finally available in the US. Some are more accurate and easier to use than others.

conceptual illustration showing various women's faces being scanned
conceptual illustration showing various women's faces being scanned

A horrifying new AI app swaps women into porn videos with a click

Deepfake researchers have long feared the day this would arrive.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.