Podcast: AI finds its voice

Synthetic voice technologies are increasingly passing as human.

April 28, 2021

Today’s voice assistants are still a far cry from the hyper-intelligent thinking machines we’ve been musing about for decades. And it’s because that technology is actually the combination of three different skills: speech recognition, natural language processing, and voice generation.

Each of these skills already presents huge challenges. In order to master just the natural language processing part? You pretty much have to re-create human-level intelligence. Deep learning, the technology driving the current AI boom, can train machines to become masters at all sorts of tasks. But it can only learn one at a time. And because most AI models train their skill set on thousands or millions of existing examples, they end up replicating patterns within historical data—including the many bad decisions people have made, like marginalizing people of color and women.

Still, systems like the board-game champion AlphaZero and the increasingly convincing fake-text generator GPT-3 have stoked the flames of debate regarding when humans will create an artificial general intelligence—machines that can multitask, think, and reason for themselves. In this episode, we explore how machines learn to communicate—and what it means for the humans on the other end of the conversation.

We meet:

Susan C. Bennett, voice of Siri
Cade Metz, The New York Times
Charlotte Jee, MIT Technology Review

Credits

This episode was produced by Jennifer Strong, Emma Cillekens, Anthony Green, Karen Hao, and Charlotte Jee. We’re edited by Michael Reilly and Niall Firth.

Transcript

[TR ID]

Jim: I don't know if it was AI… If they had taken the recording of something he had done... and were able to manipulate it... but I'm telling you, it was my son.

Strong: The day started like any other for a man... we’re going to call Jim. He lives outside Boston.

And by the way... he has a family member who works for MIT.

We’re not going to use his last name because they have concerns about their safety.

Jim: It was a Tuesday or Wednesday morning, nine o'clock I'm deep in thought working on something,

Strong: That is... until he received this call.

Jim: The phone rings… and I pick it up and it's my son. And he is clearly agitated. This, this kid's a really chill guy but when he does get upset, he has a number of vocal mannerisms. And this was like, Oh my God, he's in trouble.

And he basically told me, look, I'm in jail, I'm in Mexico. They took my phone. I only have 30 seconds. Um, they said I was drinking, but I wasn't and people are hurt. And look, I have to get off the phone, call this lawyer and it gives me a phone number and has to hang up.

Strong: His son is in Mexico… and there’s just no doubt in his mind… it’s him.

Jim: And I gotta tell you, Jennifer, it, it was him. It was his voice. It was everything. Tone. Just these little mannerisms, the, the pauses, the gulping for air, everything that you could imagine.

Strong: His heart is in his throat...

Jim: My hair standing on edge

Strong: So, he calls that phone number… A man picks up… and he offers more details on what’s going on.

Jim: Your son is being charged with hitting this car. There was a pregnant woman driving whose arm was broken. Her daughter was in the back seat.. is in critical condition and they are, um, they booked him with driving under the influence. We don't think that he has done that. This is we've, we've come across this a number of times before, but the most important thing is to get him out of jail, get him safe, as fast as possible.

Strong: Then the conversation turns to money… he’s told bail has been set… and he needs to put down ten percent.

Jim: So as soon as he started talking about money, you know, the, the flag kind of went up and I said, excuse me, is there any chance that this is a scam of some sort? And he got really kind of, um, irritated. He's like, “Hey, you called me. Look, I find this really offensive that you're accusing me of something.” And then my heart goes back in my throat. I'm like, this is the one guy who's between my son and even worse jail. So I backtracked…

[Music]

My wife walks in 10 minutes later and says, well, you know, I was texting with him late last night. Like this is around the time probably that he would have been arrested and jailed. So, of course we text him, he's just getting up. He's completely fine.

Strong: He’s still not sure how someone captured the essence of his son’s voice. But he has some theories...

Jim: They had to have gotten a recording of something when he was upset. That's the only thing that I can say, cause they couldn't have mocked up some of these things that he does. They couldn't guess at that. I don't think, and so they, I think they had certainly some raw material to work with and then what they did with it from there. I don't know.

Strong: And it’s not just Jim who's unsure… We have no idea whether AI had anything to do with this.

But, the point is… we now live in a world where we also can’t be sure that it didn’t.

It’s incredibly easy to fake someone’s voice with even a few minutes of recordings… and teenagers like Jim’s son? They share countless recordings through social media posts and messages.

Jim: I was quite impressed with how good it was. Um, like I said, I'm not easily fooled and man, they had it nailed. So, um, just caution.

Strong: I’m Jennifer Strong and this episode we look at what it takes to make a voice.

[SHOW ID]

Zeyu Gin: You guys have been making weird stuff online.

Strong: Zeyu Jin is a research scientist at Adobe… This is him speaking at a company conference about five years ago… showing how software can rearrange the words in this recording.

Key: I jumped on the bed and I kissed my dogs and my wife—in that order.

Zeyu: So how about we mess with who he actually kissed. // Introducing Project VoCo. Project VoCo allows you to edit speech in text. So let’s bring it up. So I just load this audio piece in VoCo. So as you can see we have the audio waveform and we have the text under it. //

So what do we do? Copy paste. Oh! Yeah it’s done. Let’s listen to it.

Key: And I kissed my wife and my dogs.

Zeyu: Wait there’s more. We can actually type something that’s not here.

Key: And I kissed Jordan and my dogs.

Strong: Adobe never released this prototype… but the underlying technology keeps getting better.

For example, here’s a computer-generated fake of podcaster Joe Rogan from 2019... It was produced by Square’s AI lab called Dessa to raise awareness about the technology.

Rogan: 10-7 “Friends, I've got something new to tell all of you. I’ve decided to sponsor a hockey team made up entirely of chimps.”

Strong: While it sounds like fun and games… experts warn these artificial voices could make some types of scams a whole lot more common. Things like what we heard about earlier.

Mona Sedky: Communication focused crime has historically been lower on the totem pole.

Strong: That’s federal Prosecutor Mona Sedky speaking last year at the Federal Trade Commission about voice cloning technologies.

Mona Sedky: But now with the advent of things like deep fake video… now deep fake audio you… you can basically have anonymizing tools and be anywhere on the internet you want to be…. anywhere in the world… and communicate anonymously with people. So as a result there has been an enormous uptick in communication focused crime.

Balasubramaniyan: But imagine if you as a CFO or chief controller gets a phone call that comes from your CEO’s phone number.

Strong: And this is Pindrop Security CEO Vijay Balasubramaniyan at a security conference last year.

Balasubramaniyan: It’s completely spoofed… so it actually uses your address book, and it shows up as your CEOs name... and then on the other end you hear your CEO’s voice with a tremendous amount of urgency. And we are starting to see crazy attacks like that. There was an example that a lot of press media covered, which is a $220,000 wire that happened because a CEO of a UK firm thought he was talking to his parent company… so he then sent that money out. But we’ve seen as high as $17 million dollars go out the door.

Strong: And the very idea of fake voices... can be just as damaging as a fake voice itself… Like when former president Donald Trump tried to blame the technology for some offensive things he said that were caught on tape.

But like any other tech… it’s not inherently good or bad… it’s just a tool... and I used it in the trailer for season one to show what the technology can do.

Strong: If “seeing is believing”...

How do we navigate a world where we can’t trust our eyes... or ears?

And so you know... what you’re listening to... It’s not just me speaking. I had some help from an artificial version of my voice… filling in words here and there.

Meet synthetic Jennifer.

Synthetic Jennifer: “Hi there, folks!”

Strong: I can even click to adjust my mood…

Synthetic Jennifer: “Hi there.”

Strong: Yeah, let’s not make it angry..

Strong: In the not so distant future this tech will be used in any number of ways… for simple tweaks to pre-recorded presentations… even... to bring back the voices of animated characters from a series…

In other words, artificial voices are here to stay. But they haven’t always been so easy to make… and I called up an expert whose voice might sound familiar..

Bennett: How does this sound? Um, maybe I could be a little more friendly. How are you?

Hi, I'm Susan C. Bennet, the original voice of Siri.

Well, the day that Siri appeared, which was October 4, 2011, a fellow voice actor emailed me and said, ‘Hey, we're playing around with this new iPhone app, isn't this you?’ And I said, what? I went on the Apple site and listened... and yep. That was my voice. [chuckles]

Strong: You heard that right. The original female voice that millions associate with Apple devices…? Had no idea. And she wasn’t alone. The human voices behind other early voice assistants were also taken by surprise.

Bennett: Yeah, it's been an interesting thing. It was an adjustment at first as you can imagine, because I wasn't expecting it. It was a little creepy at first, I'll have to say, I never really did a lot of talking to myself as Siri, but gradually I got accepting of it and actually it ended up turning into something really positive so…

Strong: To be clear, Apple did not steal Susan Bennett’s voice. For decades, she’s done voice work for companies like McDonald’s and Delta Airlines… and years before Siri came out… she did a strange series of recordings that fueled its development.

Bennett: In 2005, we couldn't have imagined something like Siri or Alexa. And so all of us, I've talked to other people who've had the same experience, who have been a virtual voice. You know we just thought we were doing just generic phone voice messaging. And so when suddenly Siri appeared in 2011, it's like, I'm who, what, what is this? So, it was a genuine surprise, but I like to think of it as we were just on the cutting edge of this new technology. So, you know, I choose to think of it as a very positive thing, even though, we, none of us, were ever paid for the millions and millions of phones that our voices are heard on. So that's, that's a downside.

Strong: Something else that’s awkward... she says Apple never acknowledged her as the American voice of Siri… that’s despite becoming an accidental celebrity... reaching millions.

Bennett: The only actual acknowledgement that I've ever had is via Siri. If you ask Siri "Who is Susan Bennett?" she'll say, I'm the original voice of Siri. Thanks so much, Siri. Appreciate it.

Strong: But it’s not the first time she’s given her voice to a machine.

Bennett: In the late 70s when they were introducing ATMs I like to say it was my first experience as a machine, and you know, there were no personal computers or anything at that time and people didn't trust machines. They wouldn't use the ATMs because they didn't trust the machines to give them the right money. They, you know, if they put money in the machine they were afraid they'd never see it again. And so a very enterprising advertising agency in Atlanta at the time called McDonald and Little decided to humanize the machine. So they wrote a jingle and I became the voice of Tilly the all-time teller and then they ultimately put a little face on the machine.

Strong: The human voice helps companies build trust with consumers...

Bennett: There are so many different emotions and meanings that we get across through the sound of our voices rather than just in print. That's why I think emojis came up because you can't get the nuances in there without the voice. And so I think that's why voice has become such an important part of technology.

Strong: And in her own experience, interactions with this synthetic version of her voice have led people to trust and confide in her… to call her a friend, even though they’ve never met her.

Bennett: Well, I think the oddest thing about being the voice of Siri, to me, is when I first revealed myself, it was astounding to me how many people considered Siri their friend or some sort of entity that they could really relate to. I think they actually in many cases think of her as human.

Strong: It’s estimated the global market for voice technologies will reach nearly 185-billion dollars this year... and AI-generated voices? are a game changer.

Bennett: You know, after years and years of working on these voices, it's really, really hard to get the actual rhythm of the human voice. And I'm sure they'll probably do it at some point, but you will notice even to this day, you know, you'll listen to Siri or Alexa or one of the others and they'll be talking along and it sounds good until it doesn't. Like, Oh, I'm going to the store. You know, there's some weirdness in the rhythmic sense of it.

Strong: But even once human-like voices become commonplace... she’s not entirely sure that will be a good thing.

Bennett: But you know, the advantage for them is they don't really have to get along with Siri. They can just tell Siri what to do if they don't like what she says, they can just turn it off. So it is not like real human relations. It's like maybe what people would like human relations to be. Everybody does what I want. (laughter) Then everybody's happy. Right?

Strong: Of course, voice assistants like Siri and Alexa aren’t just voices. Their capabilities come from the AI behind the scenes too.

It’s been explored in science fiction films like this one, called Her… about a man who falls in love with his voice assistant.

Theodore: How do you work?

Samantha (AI): Well... Basically I have intuition. I mean.. The DNA of who I am is based on the millions of personalities of all the programmers who wrote me, but what makes me me is my ability to grow through my experiences. So basically in every moment I'm evolving, just like you.

Strong: But today’s voice assistants are a far cry from the hyper-intelligent thinking machines we’ve been musing about for decades.

And it’s because that technology... is actually many technologies. It’s the combination of three different skills...speech recognition, natural language processing and voice generation.

Speech recognition is what allows Siri to recognize the sounds you make and transcribe them into words. Natural language processing turns those words into meaning... and figures out what to say in response. And voice generation is the final piece... the human element... that gives Siri the ability to speak.

Each of these skills is already a huge challenge... In order to master just the natural language processing part? You pretty much have to re-create human-level intelligence.

And we’re nowhere near that. But we’ve seen remarkable progress with the rise of deep learning… helping Siri and Alexa be a little more useful.

Metz: What people may not know about Siri is that original technology was something different.

Strong: Cade Metz is a tech reporter for the New York Times. His new book is called Genius Makers: The Mavericks Who Brought AI to Google, Facebook, and the World.

Metz: The way that Siri was originally built... You had to have a team of engineers, in a room, at their computers and piece by piece, they had to define with computer code how it would recognize your voice.

Strong: Back then... engineers would spend days writing detailed rules meant to show machines how to recognize words and what they mean.

And this was done at the most basic level… often working with just snippets of voice at a time.

Just imagine all the different ways people can say the word “hello” … or all the ways we piece together sentences … explaining why “time flies” or how some verbs can also be nouns.

Metz: You can never piece together everything you need, no matter how many engineers you have no matter how rich your company is. Defining every little thing that might happen when someone speaks into their iPhone… You just don't have enough person-power to build everything you need to build. It's just too complicated.

Strong: Neural networks made that process a whole lot easier… They simply learn by recognizing patterns in data fed into the system.

Metz: You take that human speech… You give it to the neural network… And the neural network learns the patterns that define human speech. That way it can recreate it without engineers having to define every little piece of it. The neural network literally learns the task on its own. And that's the key change... is that a neural network can learn to recognize what a cat looks like, as opposed to people having to define for the machine what a cat looks like.

Strong: But even before neural networks… Tech companies like Microsoft aimed to build systems that could understand the everyday way people write and talk.

And in 1996, Microsoft hired a linguist … Chris Brocket... to begin work on what they called natural-language AI.

Metz: The guy's not a computer scientist, but what his job was was to define the way that language is pieced together, right. For a computer. And that is just an incredibly difficult task, right? Why do we as English speakers order our words, the way we do, right? And he, he spent years, literally years, five or six years at Microsoft, you know, slowly, you know, trying to tell the computer the way that English is, is put together. So then the computer can do that.

Strong: Then, one afternoon in 2003… a small group at Microsoft... down the hall from Brockett... started work on a new project. They were building a system that translated languages using a technique based on statistics.

The idea being if a set of words in one language appeared with the same frequency and context in another, that was the likely translation.

Metz: They put together a prototype in a matter of weeks and showed it off to a group at the Microsoft research center—including Chris Brocket.

Strong: The system is... pretty cobbled together. It only works when applied to pieces of a sentence… And even then... the translations were jumbled.

Metz: As he sees them demonstrate this.. he has a panic attack to the point where he literally thinks he's having a heart attack because he realizes that his career might be over. That everything he has spent the past six years on // is pointless and has been made pointless by the system that these guys built in a matter of weeks.

Strong: At that time we didn’t have the amount of data needed to train a neural network, nor the processing power… but the idea of one has been around since the 1980s.

And one of those ideas came in the form of NetTalk... which was developed by AI pioneer Terry Sejnowski.

The system could learn to speak words on its own by studying children’s books.

Metz: Terry had this incredible demo that he would show to people at conferences. It was sort of time-lapsed because it took a while for the neural network to learn, but he could show that as it started to analyze the patterns in these children's books, they could start to babble…

[Sounds from NetTalk Demo]

Metz: and then it could babble a little better, and then it could start to piece words together, and then suddenly it could pronounce these words.

[Sounds from NetTalk Demo]

Metz: He could show his audience // with this demo, how a neural network could learn.

Strong: It would be another two decades before the computing power existed to really make this useful..

Metz: So natural language was an area where even after the success of neural networks with speech and image, people thought, Oh, well, it's not going to work with natural language. Well, it has. That doesn't mean it's perfect.

Strong: Deep learning, (the technology driving the current AI boom), can train machines to become masters at all sorts of tasks. But it can only learn things one at a time. And because most AI models train their skill set on thousands or millions of examples, they end up repeating patterns found in old data—including the many bad decisions that people have made, like marginalizing people of color and women.

And any big advances stir up this debate about when humans will create an artificial general intelligence—or machines that can multitask, think, and reason for themselves. Recently, that’s been advances like the board-game champion AlphaZero... and the increasingly convincing fake-text generator GPT-3...

Metz: It can, it can generate blog posts. It can generate tweets, emails. It can generate computer programs. You know, it works maybe half the time, but when it does work, you cannot tell the difference between its English and your English. Okay. That is progress. It's not the brain, it's not even close, but it's progress.

Strong: And these and other tools are also... incredibly divisive.

Metz: Can we, in the near future, build a system that can do anything the human brain can do. Right. And people will argue about this, like foaming at the mouth on either side. The reality is we have no idea. Like there are people who are completely sure this is going to happen pretty soon, but they don't know what the path is there. None of us can predict the future. And so it's an argument about nothing that can be fundamentally decided. So of course the argument never ends. You go back to the 50s and it's, it's all the same stuff, right?

Strong: But if we are to someday replicate that intelligence… might we also be able to replicate ourselves?

…That’s after the break.

[Midroll]

[Music transition]

Strong: Artificial voices have been around for a while...but they didn’t start getting more human-like until really the last five years.

Like when Deepmind’s text-to-speech algorithm called WaveNet came onto the scene… which is the basis of Google’s assistant Duplex... the one that can book your hair appointment or restaurant reservation.

[Sounds of Google Duplex scheduling a hair appointment on user’s behalf]

*Phone ringing*

Stylist: “Hello. How can I help you?”

AI: “Hi. I’m calling to book a women’s haircut for a client. Umm... I’m looking for something on May 3.”

Stylist: “Sure. Give me one second.”

Strong: Since then, many other companies have worked to commercialize similar voice technologies. Lyrebird... Descript... Respeecher... just to name a few.

There’s also a slew of companies geared toward preserving the memory of our loved ones... by creating interactive, digital versions of them.

One company, called HereAfter, turns them into chatbots that can be used with voice assistants.

And Tech Review reporter Charlotte Jee tried it out with her parents...

Jee: Alexa, open HereAfter.

Synthetic voice: Welcome to HereAfter. Would you rather speak with Paul or with Jane?

Jee: I'd rather speak with Jane.

Virtual Jane: Hello, this is Jane Jee, and I'm happy to tell you about my life.

Jee: HAHA OH MY GOD!

Virtual Jane: How are you today?

Jee: Uh, I'm well, thanks, Mum. How are you?

Virtual Jane: Good.

Strong: The system’s listening skills... aren’t the best. You can only ask short, simple questions... and you have to wait until the system has finished speaking before you can ask a new question.

Virtual Jane: There's so much to talk about. My childhood career and my interests, which of those sounds best?

Jee: Uh… your childhood.

Virtual Jane: Sure. I got into trouble as a child because I was very independent and I liked to exercise my freedom…

Strong: But what you’re hearing now... isn’t actually a fake or synthetic version of Charlotte’s mother. In fact, this experience relies on surprisingly less AI than you’d think.

Her parents were interviewed for hours by HereAfter with questions supplied by Charlotte and her sister. That interview was then edited and broken into topic sections... which can be brought up and played by the system based on the questions they ask.

But... as we’ve seen... voice is powerful. Especially when it’s presented as an interactive experience.

Jee: Oh my God. (laughter) That was so weird!

That was like hearing my mom... as a machine. That was really freaky.

I felt more emotional listening to that than I kind of expected to? When, like, the voice relaxed and it sounded like her.

Strong: This feels a lot like something we’ve seen before. Like in an episode of Black Mirror… where a woman uses her partner’s smartphone data to create a synthetic version of his voice after he dies.

[Sounds from Black Mirror - AI sifting through shared media, montage of audio clips from the woman’s deceased partner]

Strong: It sifts through old videos, texts, voicemails, and social media posts to build a system capable of mimicking his voice... and personality.

AI: “Hello?”

Woman: “...Hello! You... sound just like him.”

AI: “Almost creepy, isn’t it? I say creepy…. I mean, it’s totally batshit crazy I can even talk to you. I mean...I don’t even have a mouth.”
Woman: “That's...That’s just…

AI: “That’s what?”

Woman: “That’s just the sort of thing he would say.”

AI: “Well... that’s why I said it.”

Strong: Which brings up a thorny issue... is she building trust with her AI partner... or is it just telling her what she wants to hear... ?

And beyond how we might develop voice technologies capable of common sense or self-improvement... lies yet another question we’re just starting to raise… which is... how do we reckon with this newfound power... to synthesize something as personal as someone’s voice?

[CREDITS]

Strong: Next episode… We look at the role of automation on our credit.

Michele Gilman: The witness for the state who was a nurse, couldn't explain anything about the algorithm. She just kept repeating over and over that it was internationally and statistically validated, but she couldn't tell us how it worked, what data was fed into it, what factors it weighed, how the factors were weighed. And so my student attorney looks at me and we're looking at each other thinking, how do we cross examine an algorithm…

Strong: This episode was made by me, Emma Cillekens, Anthony Green, Karen Hao, and Charlotte Jee. We’re edited by Michael Reilly and Niall Firth.

Thanks for listening, I’m Jennifer Strong.

[TR ID]

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.