AI assistants say dumb things, and we’re about to find out why

A new test could prove that when it comes to language, today’s best AI systems are fundamentally limited.

Will Knightarchive page

March 14, 2018

<a href="www.brotherspark.co.uk/evolution-office-technology">Brother UK | Flickr</a>

Siri and Alexa are clearly far from perfect, but there is hope that steady progress in machine learning will turn them into articulate helpers before long. A new test, however, may help show that a fundamentally different approach is required for AI systems to actually master language.

Developed by researchers at the Allen Institute for AI (AI2), a nonprofit based in Seattle, the AI2 Reasoning Challenge (ARC) will pose elementary-school-level multiple-choice science questions. Each question will require some understanding of how the world works. The project is described in a related research paper (pdf).

Here’s one question: “Which item below is not made from a material grown in nature? (A) a cotton shirt (B) a wooden chair (C) a plastic spoon (D) a grass basket”

Such a question is easy for anyone who knows plastic is not something that grows. The answer taps into a common-sense picture of the world that even young children possess.

It is this common sense that the AI behind voice assistants, chatbots, and translation software lacks. And it’s one reason they are so easily confused.

Language systems that rely on machine learning can often provide convincing answers to questions if they have seen lots of similar examples before. A program trained on many thousands of IT support chats, for instance, might be able to pass itself off as a tech support helper in limited situations. But such a system would fail if asked something that required broader knowledge.

“We need to use our common sense to fill in the gaps around the language we see to form a coherent picture of what is being stated,” says Peter Clark, the lead researcher on the ARC project. “Machines do not have this common sense, and thus only see what is explicitly written, and miss the many implications and assumptions that underlie a piece of text.”

The new test is part of an initiative at AI2 to imbue AI systems with such an understanding of the world. And it is important because determining how well a language system understands what it is saying can be tricky.

For instance, in January researchers at Microsoft and another group at Alibaba developed question-and-answer programs that outperformed humans in a simple test called the Stanford Question Answering Dataset. These advances were accompanied by headlines proclaiming that AI programs could now read better than humans. But the programs could not answer more complex questions or draw on other sources of knowledge.

Tech companies will continue to tout the capabilities of AI systems in this way. Microsoft announces today that it has developed software capable of translating English news stories into Chinese, and vice versa, with results that independent volunteers deem equal to the work of professional translators. The company’s researchers used advanced deep-learning techniques to reach a new level of accuracy. While this is potentially very useful, the system would struggle if asked to translate free-ranging conversation or text from an unfamiliar domain, such as medical notes.

Gary Marcus, a professor at NYU who has argued for the importance of common sense in AI, is encourage by the AI2 challenge. “I think this is a great antidote to the kind of superficial benchmarks that have become so common in the field of machine learning,” he says. “It should really force AI researchers to up their game.”

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.