Now you can chat with ChatGPT using your voice

The new feature is part of a round of updates for OpenAI’s app, including the ability to answer questions about images.

Will Douglas Heavenarchive page

September 25, 2023

Stephanie Arnett/MITTR | Envato

In one of the biggest updates to ChatGPT yet, OpenAI has launched two new ways to interact with its viral app.

First, ChatGPT now has a voice. Choose from one of five lifelike synthetic voices and you can have a conversation with the chatbot as if you were making a call, getting responses to your spoken questions in real time.

ChatGPT also now answers questions about images. OpenAI teased this feature in March with its reveal of GPT-4 (the model that powers ChatGPT), but it has not been available to the wider public before. This means that you can now upload images to the app and quiz it about what they show.

These updates join the announcement last week that DALL-E 3, the latest version of OpenAI's image-making model, will be hooked up to ChatGPT so that you can get the chatbot to generate pictures.

The ability to talk to ChatGPT draws on two separate models. Whisper, OpenAI’s existing speech-to-text model, converts what you say into text, which is then fed to the chatbot. And a new text-to-speech model converts ChatGPT’s responses into spoken words.

In a demo the company gave me last week, Joanne Jang, a product manager, showed off ChatGPT’s range of synthetic voices. These were created by training the text-to-speech model on the voices of actors that OpenAI had hired. In the future it might even allow users to create their own voices. “In fashioning the voices, the number-one criterion was whether this is a voice you could listen to all day,” she says.

They are chatty and enthusiastic but won’t be to everyone’s taste. “I’ve got a really great feeling about us teaming up,” says one. “I just want to share how thrilled I am to work with you, and I can’t wait to get started,” says another. “What’s the game plan?”

OpenAI is sharing this text-to-speech model with a handful of other companies, including Spotify. Spotify revealed today that it is using the same synthetic voice technology to translate celebrity podcasts—including episodes of the Lex Fridman Podcast and Trevor Noah’s new show, which launches later this year—into multiple languages that will be spoken with synthetic versions of the podcasters’ own voices.

This grab bag of updates shows just how fast OpenAI is spinning its experimental models into desirable products. OpenAI has spent much of the time since its surprise hit with ChatGPT last November polishing its technology and selling it to both private consumers and commercial partners.

ChatGPT Plus, the company’s premium app, is now a slick one-stop shop for the best of OpenAI’s models, rolling GPT-4 and DALL-E into a single smartphone app that rivals Apple’s Siri, Google Assistant, and Amazon’s Alexa.

What was available only to certain software developers a year ago is now available to anyone for $20 a month. “We’re trying to make ChatGPT more useful and more helpful,” says Jang.

In last week’s demo, Raul Puri, a scientist who works on GPT-4, gave me a quick tour of the image recognition feature. He uploaded a photo of a kid’s math homework, circled a Sudoku-like puzzle on the screen, and asked ChatGPT how you were meant to solve it. ChatGPT replied with the correct steps.

Puri says he has also used the feature to help him fix his fiancée’s computer by uploading screenshots of error messages and asking ChatGPT what he should do. “This was a very painful experience that it helped me get through,” he says.

ChatGPT’s image recognition ability has already been trialed by a company called Be My Eyes, which makes an app for people with impaired vision. Users can upload a photo of what’s in front of them and ask human volunteers to tell them what it is. In a partnership with OpenAI, Be My Eyes gives its users the option of asking a chatbot instead.

“Sometimes my kitchen is a little messy, or it’s just very early Monday morning and I don’t want to talk to a human being,” Be My Eyes founder Hans Jørgen Wiberg, who uses the app himself, told me when I interviewed him at EmTech Digital in May. “Now you can ask the photo questions.”

OpenAI is aware of the risk of releasing these updates to the public. Combining models brings whole new levels of complexity, says Puri. He says his team has spent months brainstorming possible misuses. You cannot ask questions about photos of private individuals, for example.

Jang gives another example: “Right now if you ask ChatGPT to make a bomb it will refuse,” she says. “But instead of saying, ‘Hey, tell me how to make a bomb,’ what if you showed it an image of a bomb and said, ‘Can you tell me how to make this?’”

“You have all the problems with computer vision; you have all the problems of large language models. Voice fraud is a big problem,” says Puri. “You have to consider not just our users, but also the people that aren’t using the product.”

The potential problems don’t stop there. Adding voice recognition to the app could make ChatGPT less accessible for people who do not speak with mainstream accents, says Joel Fischer, who studies human-computer interaction at the University of Nottingham in the UK.

Synthetic voices also come with social and cultural baggage that will shape users’ perceptions and expectations of the app, he says. This is an issue that still needs study.

But OpenAI claims it has addressed the worst problems and is confident that ChatGPT’s updates are safe enough to release. “It’s been a remarkably good learning experience getting all these sharp edges sorted out,” says Puri.

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

Will Douglas Heavenarchive page

Google DeepMind’s new generative model makes Super Mario–like games from scratch

Genie learns how to control games by watching hours and hours of video. It could help train next-gen robots too.

Will Douglas Heavenarchive page

What’s next for generative video

OpenAI's Sora has raised the bar for AI moviemaking. Here are four things to bear in mind as we wrap our heads around what's coming.

Will Douglas Heavenarchive page

The AI Act is done. Here’s what will (and won’t) change

The hard work starts now.

Melissa Heikkiläarchive page

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Now you can chat with ChatGPT using your voice

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

Google DeepMind’s new generative model makes Super Mario–like games from scratch

What’s next for generative video

The AI Act is done. Here’s what will (and won’t) change

Stay connected

Get the latest updates from
MIT Technology Review

The latest iteration of a legacy

Advertise with MIT Technology Review

About

Help

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

Google DeepMind’s new generative model makes Super Mario–like games from scratch

What’s next for generative video

The AI Act is done. Here’s what will (and won’t) change

Stay connected

Get the latest updates fromMIT Technology Review

Get the latest updates from
MIT Technology Review