Skip to Content
Artificial intelligence

IBM’s debating AI just got a lot closer to being a useful tool

A technique called argument mining lets machines comb through huge data sets to help us make decisions. It could supercharge voice assistants.
IBM Debater system at Cambridge
IBM Debater system at Cambridge
IBM Debater system at CambridgeIBM Research

Computers have guided us to the moon and back but can’t help us with us with the biggest decisions we face today. Should Donald Trump be impeached and removed from office? Should Britain leave the EU? Should Australia stop exporting fossil fuels? Questions like these do not have easy yes or no answers, however tempting it is to think otherwise.

We make decisions by weighing pros and cons. Artificial intelligence has the potential to help us with that by sifting through ever-increasing mounds of data. But to be truly useful, it needs to reason more like a human. “We make use of persuasive language and all sorts of background knowledge that is very difficult to model in AI,” says Jacky Visser of the Center for Argument Technology at the University of Dundee, UK. “This has been one of the holy grails since people started thinking about AI.”

A core technique used to help machines reason, known as argument mining, involves building software to analyze written documents and extract key sentences that provide evidence for or against a given claim. These can then be assembled into an argument. As well as helping us make better decisions, such tools could be used to catch fake news—undermining dodgy claims and backing up factual ones—or to filter online search results, returning relevant statements rather than whole documents.

Other groups’ work on argument mining has focused on specific types of texts, such as legal documents or student essays, which tend to contain a lot of structured argument to start with. That’s useful if you want a summary of all the evidence across lots of different documents in a legal case, for example. But the ultimate goal is to build a system that can trawl through as many sources of information as possible and build an argument using every bit of evidence it can find.

IBM has just taken a big step in that direction. The company’s Project Debater team has spent several years developing an AI that can build arguments. Last year IBM demonstrated its work-in-progress technology in a live debate against a world-champion human debater, the equivalent of Watson’s Jeopardy! showdown. Such stunts are fun, and it provided a proof of concept. Now IBM is turning its toy into a genuinely useful tool.

The version of Project Debater used in the live debates included the seeds of the latest system, such as the capability to search hundreds of millions of new articles. But in the months since, the team has extensively tweaked the neural networks it uses, improving the quality of the evidence the system can unearth. One important addition is BERT, a neural network Google built for natural-language processing, which can answer queries. The work will be presented at the Association for the Advancement of Artificial Intelligence conference in New York next month.

To train their AI, lead researcher Noam Slonim and his colleagues at IBM Research in Haifa, Israel, drew on 400 million documents taken from the LexisNexis database of newspaper and journal articles. This gave them some 10 billion sentences, a natural-language corpus around 50 times larger than Wikipedia. They paired this vast evidence pool with claims about several hundred different topics, such as “Blood donation should be mandatory” or “We should abandon Valentine’s Day.”

They then asked crowd workers on the Figure Eight platform to label sentences according to whether or not they provided evidence for or against particular claims. The labeled data was fed to a supervised learning algorithm.

The resulting neural network can handle queries on a wide variety of topics, returning sentences that are more relevant than those identified by previous systems. It ranks the sentences it finds according to how good they are as evidence. For example, given the claim “Blood donation should be mandatory,” the software found the sentence “A study published in the American Journal of Epidemiology found that blood donors have 88 percent less risk of suffering from a heart attack and stroke.”

A big challenge is telling sentences that provide evidence from ones that don’t, even though they contain the same terms. Project Debater also found this sentence for the blood donation claim, for example, but was able to tell that it neither backed it up nor undermined it: “Statistics from the Nakasero Blood Bank show that students are the main blood donors, contributing about 80 percent of the blood collected worldwide.”

Exactly what it is about these sentences that the neural network picks up on to make its classification isn’t clear, says Slonim. Still, when tested Project Debater achieved 95% accuracy for the top 50 sentences across 100 different topics, he says, adding: “These numbers are unheard of.” Other systems have coped with only a few dozen topics. It is also a big improvement over the live debate system Slonim showed off last year.

Other researchers I spoke to, including Visser and Oana Cocarascu, who studies argumentation software and natural-language processing at Imperial College London, were also impressed with the new system. For Cocarascu, it is the potential for real-world applications that is most exciting. A system trained on legal documents won’t cope with the many different types of evidence found online. Slonim’s team has shown that Project Debater can handle this broad range of sources. “That's what makes it great,” says Cocarascu.

The team is now releasing its training data for others to work with. Visser wants to build argument mining tools like Project Debater that can evaluate the quality of arguments, looking out for things like cognitive bias. He and his colleagues have used AI to assess the quality of argument in the 2016 US presidential debates, for example.

IBM is doing something similar itself. Via an add-on, called Speech by Crowd, Project Debater can crowdsource arguments for and against a proposal and then automatically assess the quality of submitted arguments using a neural network trained on a data set of around 30,000 arguments previously scored for quality by humans.

IBM plans to offer Project Debater as a platform to companies and governments. “We see the future of Project Debater as an AI cloud service,” says Christopher Sciacca, a company spokesperson. In one example application, IBM collected 3,500 opinions from citizens in Lugano, Switzerland, about whether the city should invest in autonomous vehicles and used the AI to extract and assess arguments for and against the proposal. The local government could use the results to help make a policy decision. 

But for Slonim, it is all about improving our interaction with AI at a personal level. Argument plays an important part in how people communicate: we list reasons for our choices, we ask for advice, we persuade and cajole. Talking to virtual assistants that could converse at that level would feel far more natural. “What we are doing touches on something fundamental to our lives,” he says. “We’re trying to tie language-understanding technologies together to help people make better decisions.”

Deep Dive

Artificial intelligence

conceptual illustration showing various women's faces being scanned
conceptual illustration showing various women's faces being scanned

A horrifying new AI app swaps women into porn videos with a click

Deepfake researchers have long feared the day this would arrive.

storm front
storm front

DeepMind’s AI predicts almost exactly when and where it’s going to rain

The firm worked with UK weather forecasters to create a model that was better at making short term predictions than existing systems.

People are hiring out their faces to become deepfake-style marketing clones

AI-powered characters based on real people can star in thousands of videos and say anything, in any language.

Tentacle of Octopus
Tentacle of Octopus

What an octopus’s mind can teach us about AI’s ultimate mystery

Machine consciousness has been debated since Turing—and dismissed for being unscientific. Yet it still clouds our thinking about AIs like GPT-3.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.