Deep reinforcement learning works a lot like a child learning a skill: practice makes perfect. For an autonomous agent like a robot, though, its environment has to be reset to its original state between attempts—a chore that can take hours as humans scurry around replacing objects, for example.

A new arXiv paper by researchers with Google Brain, the University of Cambridge, the Max Planck Institute for Intelligent Systems, and UC Berkeley details a method that can teach an agent to reset the environment for the next attempt, as well as stop it from performing actions that would be irreversible. 

Their advance was to give agents a “forward” and “reset” policy that work together. While the forward policy is tasked with learning a skill, hitting reset forces an agent to learn how to “leave no trace,” effectively rewinding an action. Actions that the robot thinks would be irreversible are aborted as soon as possible.

The researchers write that they sought to give their agents “intuition” to classify anything that is reversible as safe, since it’s possible to go back to the original state. Through trial and error, the agent discovers that more and more actions are reversible, allowing it to explore safely.

Deep reinforcement learning is often done in simulation, and especially when real-world environments will be less forgiving of errors, like an autonomous car driving over a cliff. Even for safer situations, waiting for manual resets can become a bottleneck for data collection. For this reason, the team’s work was confined to virtual environments. Eventually, however, real-world testing has to be done, and this research could make it faster and safer.

As Jack Clark points out in his Import AI newsletter, this paper echoes the work outlined in another paper (PDF) from Facebook AI Research last month, in which a single agent has two separate modes, nicknamed Alice and Bob, one of which tries to reverse the task the other has attempted to complete. This type of work to make AI able to plan ahead could save it (and us) from disastrous mistakes in the future.