Skip to Content
Artificial intelligence

How can we be sure AI will behave? Perhaps by watching it argue with itself.

Experts suggest that having AI systems try to outwit one another could help a person judge their intentions.
OpenAI blog

Someday, it might be perfectly normal to watch an AI system fight with itself.

The concept comes from researchers at OpenAI, a nonprofit founded by several Silicon Valley luminaries, including Y Combinator partner Sam Altman, LinkedIn chair Reid Hoffman, Facebook board member and Palantir founder Peter Thiel, and Tesla and SpaceX head Elon Musk.

The OpenAI researchers have previously shown that AI systems that train themselves can sometimes develop unexpected and unwanted habits. For example, in a computer game, an agent may figure out how to “glitch” its way to a higher score. In some cases it may be possible for a person to supervise the training process. But if the AI program is doing something impossibly complex, this might not be feasible. So the researchers suggest having two systems discuss a particular objective instead.

“We believe that this or a similar approach could eventually help us train AI systems to perform far more cognitively advanced tasks than humans are capable of, while remaining in line with human preferences,” the researchers write in a blog post outlining the concept.

Take, for instance, an AI system designed to defend against human or AI hackers. To prevent the system from doing anything harmful or unethical, it may be necessary to challenge it to explain the logic for a particular action. That logic might be too complex for a person to comprehend, so the researchers suggest having another AI debate the wisdom of the action with the first system, using natural language, while the person observes. Further details appear in a research paper.

Having AI programs argue with one another requires more sophisticated technology than exists currently. So thus far, the OpenAI researchers have only explored the idea with a couple of extremely simple examples. One involves two AI systems trying to convince an observer about a hidden character by slowly revealing individual pixels.

The researchers have created a website where any two people can try playing the roles of the debating AI systems while a third serves as the judge. The two participants compete to convince the judge about the nature of an image while highlighting parts of it. Eventually it becomes easier for the observer to tell who is being honest.

Vincent Conitzer, a researcher at Duke University who studies ethical issues involving AI, says the work is at an early stage but holds promise. “Creating AI systems that can explain their decisions is a challenging research agenda,” he says. “If successful, it can greatly contribute to the responsible use of AI.”

As it stands—and despite some outlandish statements from the likes of Elon Musk (an OpenAI funder and until recently a member of its board)—we are still a long way from having AI systems capable of deceiving and outwitting us in the type of scenario portrayed in movies like Ex Machina and Her.

Still, some AI researchers are exploring ways of ensuring that the technology does not behave in unintended ways. This may become more important as AI programs become more complex and inscrutable (see “The dark secret at the heart of AI”).

“I think the idea of value alignment through debate is very interesting and potentially useful,” says Ariel Procaccia, a professor of computer science at CMU who studies decision making with autonomous systems.

However, Procaccia notes that the work is very preliminary, and that the concept may even contain a fundamental contradiction. “In order to debate value-laden questions in a way that is understandable to a human judge, the AI agents may need to have a solid grasp of human values in the first place,” he says. “So the approach is arguably putting the cart before the horse.”

Iyad Rawan, a researcher at MIT’s Media lab, adds that the researchers would need to be careful that a pair of AIs didn’t get into a circular argument. “I do think they’ll hit some very tricky issues very quickly,” he says. “First is how to automate argumentation in natural language, which is still an unsolved problem.”

Deep Dive

Artificial intelligence

This new data poisoning tool lets artists fight back against generative AI

The tool, called Nightshade, messes up training data in ways that could cause serious damage to image-generating AI models. 

Rogue superintelligence and merging with machines: Inside the mind of OpenAI’s chief scientist

An exclusive conversation with Ilya Sutskever on his fears for the future of AI and why they’ve made him change the focus of his life’s work.

Driving companywide efficiencies with AI

Advanced AI and ML capabilities revolutionize how administrative and operations tasks are done.

Generative AI deployment: Strategies for smooth scaling

Our global poll examines key decision points for putting AI to use in the enterprise.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.