Meet my alter ego, Katie:
The accent, emotion, and intonation are all mine. But somehow I now sound like a youngish woman with a high-pitched voice.
My feminine “voice skin” was created by Modulate.ai, a company based in Cambridge, Massachusetts. The firm uses machine learning to copy, model, and manipulate the properties of voice in a powerful new way.
The technology goes far beyond the simple voice filters that can let you sound like Kylo Ren. Using this approach, it is possible to assume any age, gender, or tone you’d like, all in real time. Or to take on the voice of a celebrity. I can hold a lengthy phone conversation in the guise of Katie if I wish.
I visited Modulate’s headquarters to hear about the company’s technology and ambitions, and to discuss the ethical implications of using AI to copy someone else’s voice. In a sound-isolated booth, I tried out a few of the company’s voice skins.
Here’s my actual voice:
And here it is being fed through another persona:
And being changed between the two personas in real time.
The voice-modeling technology isn’t perfect; each new voice is a little warbly. But it’s remarkably good, and it improves by feeding on more of your voice data. And it shows how advances in machine learning are rapidly starting to alter digital reality. Modulate uses generative adversarial networks (GANs) to capture and model the audio properties of a voice signal. GANs pit two neural networks against each other in a battle to capture and reproduce the properties of a data set convincingly (see “The GANfather”).
Machine learning has made it possible to swap two people’s faces in a video, using software that can be downloaded free from the internet (see “Fake America great again”). AI researchers are using GANs and other techniques to manipulate visual scenes and even conjure up completely fake faces.
Modulate has a demonstration voice skin of Barack Obama on its site, and cofounder and CEO Mike Pappas said it would be possible to generate one for anyone, given enough training data. But he adds that the company won’t make a celebrity voice skin available without the owner’s permission. He also insists that deception isn’t the main point.
“This isn’t technology built to imitate people,” Pappas says. “It’s built to give you new opportunities.”
Modulate is targeting online games such as Fornite or Call of Duty, in which players can chat with strangers through a microphone. This can enhance the game play, but it can also open the door to abuse and harassment.
“When we want to interact online and have really deep experiences, voices are crucial,” says Pappas. “But some people aren’t willing to actually put their voice out there. In some cases, maybe I just want to stay anonymous. In other cases, I’m worried that I’m going to reveal my age or gender and get harassed.”
Charles Seife, a professor at NYU who studies the spread of misinformation, says the technology seems significantly more advanced than other voice modification technology. And he says the way AI can now manipulate video and audio has the potential to fundamentally alter the media. “We have to start thinking about what constitutes reality,” he says.
"So far, the quality of voice conversion technology has been low so that one can easily distinguish a converted voice," adds Tuomas Virtanen, an expert on voice synthesis and manipulation at Tampere University in Finland. "But I can imagine that in the near future the quality will be good enough so that conversion cannot be detected easily."
Modulate is aware that its technology has the potential to be misused. The company says it will seek assurances that any customer copying someone’s voice has that person’s permission. It has also developed an audio watermarking technology that could be used to detect a copied voice. This could issue a warning if someone is using a fake voice on a call, for example.
"We've built ethical safeguards into our company from the ground up," says cofounder and CTO, Carter Huffman, "from how we distribute our technology, to how we select the voice skins to offer, to watermarking our audio for detection in sensitive systems.”
Modulate might be able to limit the misuse of its own technology, but it’s possible others will develop similar technology independently, and make it available for people to misuse. The question is, how widely might this be misused, and how savvy about it will the public become?
Pappas is optimistic that the potential for AI fakery is often overblown. “It’s definitely something where you want to be cognizant of it, but it’s not something where the very facets of society are crumbling down,” he says. “We have tools to handle this.”