Facebook is training robot assistants to hear as well as see

The company’s AI lab is pushing the boundaries of its virtual simulation platform to train AI agents to carry out tasks like “Get my ringing phone.”

Karen Haoarchive page

August 21, 2020

Facebook AI

In June 2019, Facebook’s AI lab, FAIR, released AI Habitat, a new simulation platform for training AI agents. It allowed agents to explore various realistic virtual environments, like a furnished apartment or cubicle-filled office. The AI could then be ported into a robot, which would gain the smarts to navigate through the real world without crashing.

In the year since, FAIR has rapidly pushed the boundaries of its work on “embodied AI.” In a blog post today, the lab has announced three additional milestones reached: two new algorithms that allow an agent to quickly create and remember a map of the spaces it navigates, and the addition of sound on the platform to train the agents to hear.

The algorithms build on FAIR’s work in January of this year, when an agent was trained in Habitat to navigate unfamiliar environments without a map. Using just a depth-sensing camera, GPS, and compass data, it learned to enter a space much as a human would, and find the shortest possible path to its destination without wrong turns, backtracking, or exploration.

The first of these new algorithms can now build a map of the space at the same time, allowing it to remember the environment and navigate through it faster if it returns. The second improves the agent’s ability to map the space without needing to visit every part of it. Having been trained on enough virtual environments, it is able to anticipate certain features in a new one; it can know, for example, that there is likely to be empty floor space behind a kitchen island without navigating to the other side to look. Once again, this ultimately allows the agent to move through an environment faster.

Finally, the lab also created SoundSpaces, a sound-rendering tool that allows researchers to add highly realistic acoustics to any given Habitat environment. It could render the sounds produced by hitting different pieces of furniture, or the sounds of heels versus sneakers on a floor. The addition gives Habitat the ability to train agents on tasks that require both visual and auditory sensing, like “Get my ringing phone” or “Open the door where the person is knocking.”

Of the three developments, the addition of sound training is most exciting, says Ani Kembhavi, a robotics researcher at the Allen Institute for Artificial Intelligence, who was not involved in the work. Similar research in the past has focused more on giving agents the ability to see or to respond to text commands. “Adding audio is an essential and exciting next step,” he says. “I see many different tasks where audio inputs would be very useful.” The combination of vision and sound in particular is “an underexplored research area,” says Pieter Abeel, the director of the Robot Learning Lab at University of California, Berkeley.

Each of these developments, FAIR’s researchers say, brings the lab incrementally closer to achieving intelligent robotic assistants. The goal is for such companions to be able to move about nimbly and perform sophisticated tasks like cooking.

But it will be a long time before we can let robot assistants loose in the kitchen. One of the many hurdles FAIR will need to overcome: bringing all the virtual training to bear in the physical world, a process known as “sim2real” transfer. When the researchers initially tested their virtually trained algorithms in physical robots, the process didn’t go so well.

Moving forward, the FAIR researchers hope to start adding interaction capabilities into Habitat as well. “Let’s say I’m an agent,” says Kristen Grauman, a research scientist at FAIR and a computer science professor at the University of Texas, Austin, who led some of the work. “I walk in and I see these objects. What can I do with them? Where would I go if I’m supposed to make a soufflé? What tools would I pick up? These kinds of interactions and even manipulation-based changes to the environment would bring this kind of work to another level. That’s something we’re actively pursuing.”

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.