Skip to Content
Artificial intelligence

Facebook is training robot assistants to hear as well as see

The company’s AI lab is pushing the boundaries of its virtual simulation platform to train AI agents to carry out tasks like “Get my ringing phone.”
August 21, 2020
A room in virtual simulation with a piano and a fire alarm
A room in virtual simulation with a piano and a fire alarm
Facebook AI

In June 2019, Facebook’s AI lab, FAIR, released AI Habitat, a new simulation platform for training AI agents. It allowed agents to explore various realistic virtual environments, like a furnished apartment or cubicle-filled office. The AI could then be ported into a robot, which would gain the smarts to navigate through the real world without crashing.

In the year since, FAIR has rapidly pushed the boundaries of its work on “embodied AI.” In a blog post today, the lab has announced three additional milestones reached: two new algorithms that allow an agent to quickly create and remember a map of the spaces it navigates, and the addition of sound on the platform to train the agents to hear.

The algorithms build on FAIR’s work in January of this year, when an agent was trained in Habitat to navigate unfamiliar environments without a map. Using just a depth-sensing camera, GPS, and compass data, it learned to enter a space much as a human would, and find the shortest possible path to its destination without wrong turns, backtracking, or exploration.

The first of these new algorithms can now build a map of the space at the same time, allowing it to remember the environment and navigate through it faster if it returns. The second improves the agent’s ability to map the space without needing to visit every part of it. Having been trained on enough virtual environments, it is able to anticipate certain features in a new one; it can know, for example, that there is likely to be empty floor space behind a kitchen island without navigating to the other side to look. Once again, this ultimately allows the agent to move through an environment faster.

Finally, the lab also created SoundSpaces, a sound-rendering tool that allows researchers to add highly realistic acoustics to any given Habitat environment. It could render the sounds produced by hitting different pieces of furniture, or the sounds of heels versus sneakers on a floor. The addition gives Habitat the ability to train agents on tasks that require both visual and auditory sensing, like “Get my ringing phone” or “Open the door where the person is knocking.”

Of the three developments, the addition of sound training is most exciting, says Ani Kembhavi, a robotics researcher at the Allen Institute for Artificial Intelligence, who was not involved in the work. Similar research in the past has focused more on giving agents the ability to see or to respond to text commands. “Adding audio is an essential and exciting next step,” he says. “I see many different tasks where audio inputs would be very useful.” The combination of vision and sound in particular is “an underexplored research area,” says Pieter Abeel, the director of the Robot Learning Lab at University of California, Berkeley.

Each of these developments, FAIR’s researchers say, brings the lab incrementally closer to achieving intelligent robotic assistants. The goal is for such companions to be able to move about nimbly and perform sophisticated tasks like cooking.

But it will be a long time before we can let robot assistants loose in the kitchen. One of the many hurdles FAIR will need to overcome: bringing all the virtual training to bear in the physical world, a process known as “sim2real” transfer. When the researchers initially tested their virtually trained algorithms in physical robots, the process didn’t go so well.

Moving forward, the FAIR researchers hope to start adding interaction capabilities into Habitat as well. “Let’s say I’m an agent,”  says Kristen Grauman, a research scientist at FAIR and a computer science professor at the University of Texas, Austin, who led some of the work. “I walk in and I see these objects. What can I do with them? Where would I go if I’m supposed to make a soufflé? What tools would I pick up? These kinds of interactions and even manipulation-based changes to the environment would bring this kind of work to another level. That’s something we’re actively pursuing.”

Deep Dive

Artificial intelligence

conceptual illustration showing various women's faces being scanned
conceptual illustration showing various women's faces being scanned

A horrifying new AI app swaps women into porn videos with a click

Deepfake researchers have long feared the day this would arrive.

Conceptual illustration of a therapy session
Conceptual illustration of a therapy session

The therapists using AI to make therapy better

Researchers are learning more about how therapy works by examining the language therapists use with clients. It could lead to more people getting better, and staying better.

a Chichuahua standing on a Great Dane
a Chichuahua standing on a Great Dane

DeepMind says its new language model can beat others 25 times its size

RETRO uses an external memory to look up passages of text on the fly, avoiding some of the costs of training a vast neural network

THE BLOB, 1958, promotional artwork
THE BLOB, 1958, promotional artwork

2021 was the year of monster AI models

GPT-3, OpenAI’s program to mimic human language,  kicked off a new trend in artificial intelligence for bigger and bigger models. How large will they get, and at what cost?

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.