Skip to Content
Artificial intelligence

Facebook is training robot assistants to hear as well as see

The company’s AI lab is pushing the boundaries of its virtual simulation platform to train AI agents to carry out tasks like “Get my ringing phone.”
August 21, 2020
A room in virtual simulation with a piano and a fire alarm
Facebook AI

In June 2019, Facebook’s AI lab, FAIR, released AI Habitat, a new simulation platform for training AI agents. It allowed agents to explore various realistic virtual environments, like a furnished apartment or cubicle-filled office. The AI could then be ported into a robot, which would gain the smarts to navigate through the real world without crashing.

In the year since, FAIR has rapidly pushed the boundaries of its work on “embodied AI.” In a blog post today, the lab has announced three additional milestones reached: two new algorithms that allow an agent to quickly create and remember a map of the spaces it navigates, and the addition of sound on the platform to train the agents to hear.

The algorithms build on FAIR’s work in January of this year, when an agent was trained in Habitat to navigate unfamiliar environments without a map. Using just a depth-sensing camera, GPS, and compass data, it learned to enter a space much as a human would, and find the shortest possible path to its destination without wrong turns, backtracking, or exploration.

The first of these new algorithms can now build a map of the space at the same time, allowing it to remember the environment and navigate through it faster if it returns. The second improves the agent’s ability to map the space without needing to visit every part of it. Having been trained on enough virtual environments, it is able to anticipate certain features in a new one; it can know, for example, that there is likely to be empty floor space behind a kitchen island without navigating to the other side to look. Once again, this ultimately allows the agent to move through an environment faster.

Finally, the lab also created SoundSpaces, a sound-rendering tool that allows researchers to add highly realistic acoustics to any given Habitat environment. It could render the sounds produced by hitting different pieces of furniture, or the sounds of heels versus sneakers on a floor. The addition gives Habitat the ability to train agents on tasks that require both visual and auditory sensing, like “Get my ringing phone” or “Open the door where the person is knocking.”

Of the three developments, the addition of sound training is most exciting, says Ani Kembhavi, a robotics researcher at the Allen Institute for Artificial Intelligence, who was not involved in the work. Similar research in the past has focused more on giving agents the ability to see or to respond to text commands. “Adding audio is an essential and exciting next step,” he says. “I see many different tasks where audio inputs would be very useful.” The combination of vision and sound in particular is “an underexplored research area,” says Pieter Abeel, the director of the Robot Learning Lab at University of California, Berkeley.

Each of these developments, FAIR’s researchers say, brings the lab incrementally closer to achieving intelligent robotic assistants. The goal is for such companions to be able to move about nimbly and perform sophisticated tasks like cooking.

But it will be a long time before we can let robot assistants loose in the kitchen. One of the many hurdles FAIR will need to overcome: bringing all the virtual training to bear in the physical world, a process known as “sim2real” transfer. When the researchers initially tested their virtually trained algorithms in physical robots, the process didn’t go so well.

Moving forward, the FAIR researchers hope to start adding interaction capabilities into Habitat as well. “Let’s say I’m an agent,”  says Kristen Grauman, a research scientist at FAIR and a computer science professor at the University of Texas, Austin, who led some of the work. “I walk in and I see these objects. What can I do with them? Where would I go if I’m supposed to make a soufflé? What tools would I pick up? These kinds of interactions and even manipulation-based changes to the environment would bring this kind of work to another level. That’s something we’re actively pursuing.”

Deep Dive

Artificial intelligence

The inside story of how ChatGPT was built from the people who made it

Exclusive conversations that take us behind the scenes of a cultural phenomenon.

AI is dreaming up drugs that no one has ever seen. Now we’ve got to see if they work.

AI automation throughout the drug development pipeline is opening up the possibility of faster, cheaper pharmaceuticals.

The original startup behind Stable Diffusion has launched a generative AI for video

Runway’s new model, called Gen-1, can change the visual style of existing videos and movies.

GPT-4 is bigger and better than ChatGPT—but OpenAI won’t say why

We got a first look at the much-anticipated big new language model from OpenAI. But this time how it works is even more deeply under wraps.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.