Skip to Content
Artificial intelligence

How can we be sure AI will behave? Perhaps by watching it argue with itself.

Experts suggest that having AI systems try to outwit one another could help a person judge their intentions.
OpenAI blog

Someday, it might be perfectly normal to watch an AI system fight with itself.

The concept comes from researchers at OpenAI, a nonprofit founded by several Silicon Valley luminaries, including Y Combinator partner Sam Altman, LinkedIn chair Reid Hoffman, Facebook board member and Palantir founder Peter Thiel, and Tesla and SpaceX head Elon Musk.

The OpenAI researchers have previously shown that AI systems that train themselves can sometimes develop unexpected and unwanted habits. For example, in a computer game, an agent may figure out how to “glitch” its way to a higher score. In some cases it may be possible for a person to supervise the training process. But if the AI program is doing something impossibly complex, this might not be feasible. So the researchers suggest having two systems discuss a particular objective instead.

“We believe that this or a similar approach could eventually help us train AI systems to perform far more cognitively advanced tasks than humans are capable of, while remaining in line with human preferences,” the researchers write in a blog post outlining the concept.

Take, for instance, an AI system designed to defend against human or AI hackers. To prevent the system from doing anything harmful or unethical, it may be necessary to challenge it to explain the logic for a particular action. That logic might be too complex for a person to comprehend, so the researchers suggest having another AI debate the wisdom of the action with the first system, using natural language, while the person observes. Further details appear in a research paper.

Having AI programs argue with one another requires more sophisticated technology than exists currently. So thus far, the OpenAI researchers have only explored the idea with a couple of extremely simple examples. One involves two AI systems trying to convince an observer about a hidden character by slowly revealing individual pixels.

The researchers have created a website where any two people can try playing the roles of the debating AI systems while a third serves as the judge. The two participants compete to convince the judge about the nature of an image while highlighting parts of it. Eventually it becomes easier for the observer to tell who is being honest.

Vincent Conitzer, a researcher at Duke University who studies ethical issues involving AI, says the work is at an early stage but holds promise. “Creating AI systems that can explain their decisions is a challenging research agenda,” he says. “If successful, it can greatly contribute to the responsible use of AI.”

As it stands—and despite some outlandish statements from the likes of Elon Musk (an OpenAI funder and until recently a member of its board)—we are still a long way from having AI systems capable of deceiving and outwitting us in the type of scenario portrayed in movies like Ex Machina and Her.

Still, some AI researchers are exploring ways of ensuring that the technology does not behave in unintended ways. This may become more important as AI programs become more complex and inscrutable (see “The dark secret at the heart of AI”).

“I think the idea of value alignment through debate is very interesting and potentially useful,” says Ariel Procaccia, a professor of computer science at CMU who studies decision making with autonomous systems.

However, Procaccia notes that the work is very preliminary, and that the concept may even contain a fundamental contradiction. “In order to debate value-laden questions in a way that is understandable to a human judge, the AI agents may need to have a solid grasp of human values in the first place,” he says. “So the approach is arguably putting the cart before the horse.”

Iyad Rawan, a researcher at MIT’s Media lab, adds that the researchers would need to be careful that a pair of AIs didn’t get into a circular argument. “I do think they’ll hit some very tricky issues very quickly,” he says. “First is how to automate argumentation in natural language, which is still an unsolved problem.”

Deep Dive

Artificial intelligence

Geoffrey Hinton tells us why he’s now scared of the tech he helped build

“I have suddenly switched my views on whether these things are going to be more intelligent than us.”

ChatGPT is going to change education, not destroy it

The narrative around cheating students doesn’t tell the whole story. Meet the teachers who think generative AI could actually make learning better.

Deep learning pioneer Geoffrey Hinton has quit Google

Hinton will be speaking at EmTech Digital on Wednesday.

We are hurtling toward a glitchy, spammy, scammy, AI-powered internet

Large language models are full of security vulnerabilities, yet they’re being embedded into tech products on a vast scale.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.