AI learned to use tools after nearly 500 million games of hide and seek

OpenAI’s agents evolved to exhibit complex behaviors, suggesting a promising approach for developing more sophisticated artificial intelligence.

Karen Haoarchive page

September 17, 2019

OpenAi

In the early days of life on Earth, biological organisms were exceedingly simple. They were microscopic unicellular creatures with little to no ability to coordinate. Yet billions of years of evolution through competition and natural selection led to the complex life forms we have today—as well as complex human intelligence.

Researchers at OpenAI, the San Francisco–based for-profit AI research lab, are now testing a hypothesis: if you could mimic that kind of competition in a virtual world, would it also give rise to much more sophisticated artificial intelligence?

The experiment builds on two existing ideas in the field: multi-agent learning, the idea of placing multiple algorithms in competition or coordination to provoke emergent behaviors, and reinforcement learning, the specific machine-learning technique that learns to achieve a goal through trial and error. (DeepMind popularized the latter with its breakthrough program AlphaGo, which beat the best human player in the ancient Chinese board game Go.)

In a new paper released today, OpenAI has now revealed its initial results. Through playing a simple game of hide and seek hundreds of millions of times, two opposing teams of AI agents developed complex hiding and seeking strategies that involved tool use and collaboration. The research also offers insight into OpenAI’s dominant research strategy: to dramatically scale existing AI techniques to see what properties emerge.

The hiders learn to build forts. — After roughly 25 million rounds of hide-and-seek, the hiders learned to build forts to win the game.
OpenAI

Six phases of strategy

To create the game, the researchers designed a virtual environment that consisted of an enclosed space with various objects like blocks, ramps, and mobile and immobile barricades. The agents themselves were controlled by reinforcement-learning algorithms. For each game, the agents were split into two teams: hiders were rewarded or penalized for avoiding or failing to avoid the seekers, respectively; seekers were also rewarded or penalized for finding or failing to find the hiders. As in a classic game of hide and seek, hiders were also given a few seconds’ head start. The researchers gave the agents no other instructions.

Over the course of hundreds of millions of rounds of game play, the agents developed several strategies and counter-strategies. During the first phase, the hiders and seekers developed very crude avoiding and chasing strategies by moving around the space without manipulating any of the objects.

In phase three, the seekers learned to climb over the fort's walls using a ramp.
OpenAI

But around the 25-million-game mark, play became more sophisticated. The hiders learned to move and lock the boxes and barricades in the environment to build forts around themselves so the seekers would never see them. Not only that, but they also developed coordination strategies, such as passing each other objects, in order to speed up the fort-building process.

After another 75 million rounds, however, the seekers discovered a counter-strategy, moving the process into phase three. They learned to move a ramp next to the hiders’ fort and use it to climb over the walls. By phase four, 10 million rounds later, the hiders had learned to lock the ramps in place before building their fort. (Neither team could unlock an object that was locked by an opposing team.)

At first, the researchers at OpenAI believed that this was the last phase of game play, but finally, at the 380-million-game mark, two more strategies emerged. The seekers once again developed a strategy to break into the hiders’ fort by using a locked ramp to climb onto an unlocked box, then “surf” their way on top of the box to the fort and over its walls. In the final phase, the hiders once again learned to lock all the ramps and boxes in place before building their fort.

Finally, after 380 million rounds of training, the seekers learned to "box surf" to climb over the fort's walls.
OpenAI

Emergent intelligence

The researchers believe that these initial results demonstrate a promising new method for evolving more complex AI. “We didn’t tell the hiders or the seekers to run near a box or interact with it,” says Bowen Baker, one of the authors of the paper. “But through multiagent competition, they created new tasks for each other such that the other team had to adapt.”

This study reflects OpenAI’s distinctive approach to AI research. Though the lab has also invested in developing novel techniques relative to other labs, it has primarily made a name for itself by dramatically scaling existing ones. GPT-2, the lab’s infamous language model, for example, heavily borrowed algorithmic design from earlier language models, including Google’s BERT; OpenAI’s primary innovation was a feat of engineering and expansive computational resources.

In a way, this study reaffirms the value of testing the limits of existing technologies at scale. The team also plans to continue with this strategy. The researchers say that the first round of experiments didn't come close to reaching the limits of the computational resources they could throw at the problem.

“We want people to imagine what would happen if you induced this kind of competition in a much more complex environment,” Baker says. “The behaviors they learn might actually be able to eventually solve some problems that we maybe don’t know how to solve already.”

Correction: The original headlined misstated the number of games the agents played.

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.