Skip to Content
Artificial intelligence

A way to let robots learn by listening will make them more useful

For robots to move beyond warehouses and into homes, they’ll need to navigate using more than just vision.

screenshot from a fisheye video of a person pouring water from a mug into a plastic bottle of water held by a plastic hand with an audio soundwave inset at the bottom of the frame
Zeyi Liu et al

Most AI-powered robots today use cameras to understand their surroundings and learn new tasks, but it’s becoming easier to train robots with sound too, helping them adapt to tasks and environments where visibility is limited. 

Though sight is important, there are daily tasks where sound is actually more helpful, like listening to onions sizzling on the stove to see if the pan is at the right temperature. Training robots with audio has only been done in highly controlled lab settings, however, and the techniques have lagged behind other fast robot-teaching methods.

Researchers at the Robotics and Embodied AI Lab at Stanford University set out to change that. They first built a system for collecting audio data, consisting of a GoPro camera and a gripper with a microphone designed to filter out background noise. Human demonstrators used the gripper for a variety of household tasks and then used this data to teach robotic arms how to execute the task on their own. The team’s new training algorithms help robots gather clues from audio signals to perform more effectively. 

“Thus far, robots have been training on videos that are muted,” says Zeyi Liu, a PhD student at Stanford and lead author of the study. “But there is so much helpful data in audio.”

To test how much more successful a robot can be if it’s capable of “listening,” the researchers chose four tasks: flipping a bagel in a pan, erasing a whiteboard, putting two Velcro strips together, and pouring dice out of a cup. In each task, sounds provide clues that cameras or tactile sensors struggle with, like knowing if the eraser is properly contacting the whiteboard or whether the cup contains dice. 

After demonstrating each task a couple of hundred times, the team compared the success rates of training with audio and training only with vision. The results, published in a paper on arXiv that has not been peer-reviewed, were promising. When using vision alone in the dice test, the robot could tell 27% of the time if there were dice in the cup, but that rose to 94% when sound was included.

It isn’t the first time audio has been used to train robots, says Shuran Song, the head of the lab that produced the study, but it’s a big step toward doing so at scale: “We are making it easier to use audio collected ‘in the wild,’ rather than being restricted to collecting it in the lab, which is more time consuming.” 

The research signals that audio might become a more sought-after data source in the race to train robots with AI. Researchers are teaching robots faster than ever before using imitation learning, showing them hundreds of examples of tasks being done instead of hand-coding each one. If audio could be collected at scale using devices like the one in the study, it could give them an entirely new “sense,” helping them more quickly adapt to environments where visibility is limited or not useful.

“It’s safe to say that audio is the most understudied modality for sensing [in robots],” says Dmitry Berenson, associate professor of robotics at the University of Michigan, who was not involved in the study. That’s because the bulk of research on training robots to manipulate objects has been for industrial pick-and-place tasks, like sorting objects into bins. Those tasks don’t benefit much from sound, instead relying on tactile or visual sensors. But as robots broaden into tasks in homes, kitchens, and other environments, audio will become increasingly useful, Berenson says.

Consider a robot trying to find which bag or pocket contains a set of keys, all with limited visibility. “Maybe even before you touch the keys, you hear them kind of jangling,” Berenson says. “That’s a cue that the keys are in that pocket instead of others.”

Still, audio has limits. The team points out sound won’t be as useful with so-called soft or flexible objects like clothes, which don’t create as much usable audio. The robots also struggled with filtering out the audio of their own motor noises during tasks, since that noise was not present in the training data produced by humans. To fix it, the researchers needed to add robot sounds—whirs, hums, and actuator noises—into the training sets so the robots could learn to tune them out. 

The next step, Liu says, is to see how much better the models can get with more data, which could mean adding more microphones, collecting spatial audio, and incorporating microphones into other types of data-collection devices. 

Deep Dive

Artificial intelligence

What is AI?

Everyone thinks they know but no one can agree. And that’s a problem.

What are AI agents? 

The next big thing is AI tools that can do more complex tasks. Here’s how they will work.

How to use AI to plan your next vacation

AI tools can be useful for everything from booking flights to translating menus.

Why Google’s AI Overviews gets things wrong

Google’s new AI search feature is a mess. So why is it telling us to eat rocks and gluey pizza, and can it be fixed?

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.