Tougher Turing Test Exposes Chatbots’ Stupidity

We have a long way to go if we want virtual assistants to understand us.

Will Knightarchive page

July 14, 2016

User: Siri, call me an ambulance.

Siri: Okay, from now on I’ll call you “an ambulance.”

Apple fixed this error shortly after its virtual assistant was first released in 2011. But a new contest shows that computers still lack the common sense required to avoid such embarrassing mix-ups.

The results of the contest were presented at an academic conference in New York this week, and they provide some measure of how much work needs to be done to make computers truly intelligent.

The Winograd Schema Challenge asks computers to make sense of sentences that are ambiguous but usually simple for humans to parse. Disambiguating Winograd Schema sentences requires some common-sense understanding. In the sentence “The city councilmen refused the demonstrators a permit because they feared violence,” it is logically unclear who the word “they” refers to, although humans understand because of the broader context.

The programs entered into the challenge were a little better than random at choosing the correct meaning of sentences. The best two entrants were correct 48 percent of the time, compared to 45 percent if the answers are chosen at random. To be eligible to claim the grand prize of $25,000, entrants would need to achieve at least 90 percent accuracy. The joint best entries came from Quan Liu, a researcher at the University of Science and Technology of China, and Nicos Issak, a researcher from the Open University of Cyprus.

“It’s unsurprising that machines were barely better than chance,” says Gary Marcus, a research psychologist at New York University and an advisor to the contest. That’s because giving computers common-sense knowledge is notoriously difficult. Hand-coding knowledge is impossibly time-consuming, and it isn’t simple for computers to learn about the real world by performing statistical analysis of text. Most of the entrants in the Winograd Schema Challenge try to use some combination of hand-coded grammar understanding and a knowledge base of facts.

Marcus, who is also the cofounder of a new AI startup, Geometric Intelligence, says it’s notable that Google and Facebook did not take part in the event, even though researchers at these companies have suggested they are making major progress in natural language understanding. “It could’ve been that those guys waltzed into this room and got a hundred percent and said ‘hah!’” he says. “But that would’ve astounded me.”

The contest does not only serve as a measure of progress in AI. It also shows how hard it will be to build more intuitive and graceful chatbots, and to train computers to extract more information from written text.

Researchers at Google, Facebook, Amazon, and Microsoft are turning their attention to language. They are using the latest machine learning techniques, especially “deep learning” neural networks, to develop smarter, more intuitive chatbots and personal assistants (see “Teaching Machines to Understand Us”). As a matter of fact, with chatbots and voice assistants becoming more common, and with dramatic progress in areas like image and speech recognition, you might think that machines were getting pretty good at understanding language.

One of the two first-place entries did, in fact, use a cutting-edge machine learning approach. Liu’s group, which included researchers from York University in Toronto and the National Research Council of Canada, used deep learning to train a computer to recognize the relationship between different events, such as “playing basketball” and “winning” or “getting injured,” from thousands of texts.

“I was delighted to see deep learning used,” says Leora Morgenstern, a senior scientist at Leidos Corporation, a technology consulting firm, and one of the organizers of the challenge.

Liu’s team claims that after fixing a problem with the way its system parsed the contest’s questions, it is almost 60 percent accurate. Morgenstern cautions, however, that even if these claims were confirmed, the accuracy would still be far worse than a human's.

Winograd Schema sentences were first highlighted as a way to gauge machine comprehension by Hector Levesque, an artificial-intelligence researcher at the University of Toronto. They are named after Terry Winograd, a pioneer in the field and a professor at Stanford University who built one of the first conversational computer programs.

The challenge was proposed in 2014 as an improvement on the Turing Test. Alan Turing, a forefather of computing and artificial intelligence who in the 1950s pondered whether machines might one day think as humans do, suggested a simple way of testing a machine’s intelligence. His idea was for a machine to try to fool a person into thinking that he was conversing with a real person in a text conversation.

The problem with the Turing Test is that it’s often easy for a program to fool a person using simple tricks and evasions. But a program cannot parse Winograd Schema or other ambiguous sentences without some form of general knowledge.

The contest could have significant practical implications. “It’s going to come up when you start to support dialogues,” says Charlie Ortiz, a senior principal researcher at Nuance, a company that makes voice recognition and voice interface software, which sponsored the Winograd Schema Challenge. Ortiz says common-sense reasoning will be required for even simple conversations with computers. “In shopping, if I say, ‘I want to get a case for my guitar; it should be strong.’ So does ‘it’ refer to the case or the guitar?”

Marcus adds that common-sense reasoning will become more important as devices such as smart appliances or wearable gadgets become more common. “When you want to ask a query of your watch you don’t get to scroll through 50 choices,” he says. “When you start talking to your car or your watch, and you get rid of the typing modality and want to have a connected set of sentences—this conversational discourse—people just naturally refer back to things, and you need to solve these problems to make it work.”

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.