Skip to Content
Artificial intelligence

A bot that watched 70,000 hours of Minecraft could unlock AI’s next big thing

Online videos are a vast and untapped source of training data—and OpenAI says it has a new way to use it.

Minedojo

OpenAI has built the best Minecraft-playing bot yet by making it watch 70,000 hours of video of people playing the popular computer game. It showcases a powerful new technique that could be used to train machines to carry out a wide range of tasks by binging on sites like YouTube, a vast and untapped source of training data.

The Minecraft AI learned to perform complicated sequences of keyboard and mouse clicks to complete tasks in the game, such as chopping down trees and crafting tools. It’s the first bot that can craft so-called diamond tools, a task that typically takes good human players 20 minutes of high-speed clicking—or around 24,000 actions.

The result is a breakthrough for a technique known as imitation learning, in which neural networks are trained to perform tasks by watching humans do them. Imitation learning can be used to train AI to control robot arms, drive cars, or navigate web pages.  

There is a vast amount of video online showing people doing different tasks. By tapping into this resource, the researchers hope to do for imitation learning what GPT-3 did for large language models. “In the last few years we’ve seen the rise of this GPT-3 paradigm where we see amazing capabilities come from big models trained on enormous swathes of the internet,” says Bowen Baker at OpenAI, one of the team behind the new Minecraft bot. “A large part of that is because we’re modeling what humans do when they go online.”

The problem with existing approaches to imitation learning is that video demonstrations need to be labeled at each step: doing this action makes this happen, doing that action makes that happen, and so on. Annotating by hand in this way is a lot of work, and so such data sets tend to be small. Baker and his colleagues wanted to find a way to turn the millions of videos that are available online into a new data set.

The team’s approach, called Video Pre-Training (VPT), gets around the bottleneck in imitation learning by training another neural network to label videos automatically. The researchers first hired crowdworkers to play Minecraft, and recorded their keyboard and mouse clicks alongside the video from their screens. This gave them 2,000 hours of annotated Minecraft play, which they used to train a model to match actions to onscreen outcomes. Clicking a mouse button in a certain situation makes the character swing its ax, for example.  

The next step was to use this model to generate action labels for 70,000 hours of unlabeled video taken from the internet and then train the Minecraft bot on this larger data set.

“Video is a training resource with a lot of potential,” says Peter Stone, executive director of Sony AI America, who has previously worked on imitation learning. 

Imitation learning is an alternative to reinforcement learning, in which a neural network learns to perform a task from scratch via trial and error. This is the technique behind many of the biggest AI breakthroughs in the last few years. It has been used to train models that can beat humans at games, control a fusion reactor, and discover a faster way to do fundamental math.

The problem is that reinforcement learning works best for tasks that have a clear goal, where random actions can lead to accidental success. Reinforcement learning algorithms reward those accidental successes to make them more likely to happen again.

But Minecraft is a game with no clear goal. Players are free to do what they like: wandering a computer-generated world, mining different materials, and combining them to make different objects. 

Minecraft’s open-endedness makes it a good environment for training AI. Baker was one of the researchers behind Hide & Seek, a project in which bots were let loose in a virtual playground where they used reinforcement learning to figure out how to cooperate and use tools to win simple games. But the bots soon outgrew their surroundings. “The agents kind of took over the universe; there was nothing else for them to do,” says Baker. “We wanted to expand it, and we thought Minecraft was a great domain to work in.”

They’re not alone. Minecraft is becoming an important testbed for new AI techniques. MineDojo, a Minecraft environment with dozens of predesigned challenges, won an award at this year’s NeurIPS, one of the biggest AI conferences. 

Using VPT, OpenAI’s bot was able to carry out tasks that would have been impossible using reinforcement learning alone, such as crafting planks and turning them into a table, which involves around 970 consecutive actions. Even so, the team found that the best results came from using imitation learning and reinforcement learning together. Taking a bot trained with VPT and fine-tuning it with reinforcement learning allowed it to carry out tasks involving more than 20,000 consecutive actions.  

The researchers claim that their approach could be used to train AI to carry out other tasks. To begin with, it could be used to for bots that use a keyboard and mouse to navigate websites, book flights, or buy groceries online. But in theory it could be used to train robots to carry out physical, real-world tasks by copying first-person video of people doing those things. “It’s plausible,” says Stone.

Matthew Guzdial at the University of Alberta in Canada, who has used videos to teach AI the rules of games like Super Mario Bros., does not think it will happen any time soon, however. Actions in games like Minecraft and Super Mario Bros. are performed by pressing buttons. Actions in the physical world are far more complicated and harder for a machine to learn. "It unlocks a whole mess of new research problems,” says Guzdial.

“This work is another testament to the power of scaling up models and training on massive data sets to get good performance,” says Natasha Jaques, who works on multi-agent reinforcement learning at Google and the University of California, Berkeley. 

Large internet-sized data sets will certainly unlock new capabilities for AI, says Jaques: “We've seen that over and over again, and it’s a great approach.” But OpenAI places a lot of faith in the power of large data sets alone, she says: “Personally, I’m a little more skeptical that data can solve any problem.”

Still, Baker and his colleagues think that collecting more than a million hours of Minecraft videos will make their AI even better. It’s probably the best Minecraft-playing bot yet, says Baker: “But with more data and bigger models, I would expect it to feel like you’re watching a human playing the game, as opposed to a baby AI trying to mimic a human.”

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

Google DeepMind’s new generative model makes Super Mario–like games from scratch

Genie learns how to control games by watching hours and hours of video. It could help train next-gen robots too.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.