We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.

Intelligent Machines

Tougher Turing Test Exposes Chatbots’ Stupidity

We have a long way to go if we want virtual assistants to understand us.

User: Siri, call me an ambulance.

Siri: Okay, from now on I’ll call you “an ambulance.”

Apple fixed this error shortly after its virtual assistant was first released in 2011. But a new contest shows that computers still lack the common sense required to avoid such embarrassing mix-ups.

The results of the contest were presented at an academic conference in New York this week, and they provide some measure of how much work needs to be done to make computers truly intelligent.

Illustration by Max Bode

The Winograd Schema Challenge asks computers to make sense of sentences that are ambiguous but usually simple for humans to parse. Disambiguating Winograd Schema sentences requires some common-sense understanding. In the sentence “The city councilmen refused the demonstrators a permit because they feared violence,” it is logically unclear who the word “they” refers to, although humans understand because of the broader context.

The programs entered into the challenge were a little better than random at choosing the correct meaning of sentences. The best two entrants were correct 48 percent of the time, compared to 45 percent if the answers are chosen at random. To be eligible to claim the grand prize of $25,000, entrants would need to achieve at least 90 percent accuracy. The joint best entries came from Quan Liu, a researcher at the University of Science and Technology of China, and Nicos Issak, a researcher from the Open University of Cyprus.

“It’s unsurprising that machines were barely better than chance,” says Gary Marcus, a research psychologist at New York University and an advisor to the contest. That’s because giving computers common-sense knowledge is notoriously difficult. Hand-coding knowledge is impossibly time-consuming, and it isn’t simple for computers to learn about the real world by performing statistical analysis of text. Most of the entrants in the Winograd Schema Challenge try to use some combination of hand-coded grammar understanding and a knowledge base of facts.

Marcus, who is also the cofounder of a new AI startup, Geometric Intelligence, says it’s notable that Google and Facebook did not take part in the event, even though researchers at these companies have suggested they are making major progress in natural language understanding. “It could’ve been that those guys waltzed into this room and got a hundred percent and said ‘hah!’” he says. “But that would’ve astounded me.”

The contest does not only serve as a measure of progress in AI. It also shows how hard it will be to build more intuitive and graceful chatbots, and to train computers to extract more information from written text.

Researchers at Google, Facebook, Amazon, and Microsoft are turning their attention to language. They are using the latest machine learning techniques, especially “deep learning” neural networks, to develop smarter, more intuitive chatbots and personal assistants (see “Teaching Machines to Understand Us”). As a matter of fact, with chatbots and voice assistants becoming more common, and with dramatic progress in areas like image and speech recognition, you might think that machines were getting pretty good at understanding language.

One of the two first-place entries did, in fact, use a cutting-edge machine learning approach. Liu’s group, which included researchers from York University in Toronto and the National Research Council of Canada, used deep learning to train a computer to recognize the relationship between different events, such as “playing basketball” and “winning” or “getting injured,” from thousands of texts.

“I was delighted to see deep learning used,” says Leora Morgenstern, a senior scientist at Leidos Corporation, a technology consulting firm, and one of the organizers of the challenge.

Liu’s team claims that after fixing a problem with the way its system parsed the contest’s questions, it is almost 60 percent accurate. Morgenstern cautions, however, that even if these claims were confirmed, the accuracy would still be far worse than a human's.

Winograd Schema sentences were first highlighted as a way to gauge machine comprehension by Hector Levesque, an artificial-intelligence researcher at the University of Toronto. They are named after Terry Winograd, a pioneer in the field and a professor at Stanford University who built one of the first conversational computer programs.

The challenge was proposed in 2014 as an improvement on the Turing Test. Alan Turing, a forefather of computing and artificial intelligence who in the 1950s pondered whether machines might one day think as humans do, suggested a simple way of testing a machine’s intelligence. His idea was for a machine to try to fool a person into thinking that he was conversing with a real person in a text conversation.

The problem with the Turing Test is that it’s often easy for a program to fool a person using simple tricks and evasions. But a program cannot parse Winograd Schema or other ambiguous sentences without some form of general knowledge.

The contest could have significant practical implications. “It’s going to come up when you start to support dialogues,” says Charlie Ortiz, a senior principal researcher at Nuance, a company that makes voice recognition and voice interface software, which sponsored the Winograd Schema Challenge. Ortiz says common-sense reasoning will be required for even simple conversations with computers. “In shopping, if I say, ‘I want to get a case for my guitar; it should be strong.’ So does ‘it’ refer to the case or the guitar?”

Marcus adds that common-sense reasoning will become more important as devices such as smart appliances or wearable gadgets become more common. “When you want to ask a query of your watch you don’t get to scroll through 50 choices,” he says. “When you start talking to your car or your watch, and you get rid of the typing modality and want to have a connected set of sentences—this conversational discourse—people just naturally refer back to things, and you need to solve these problems to make it work.”


Keep up with the latest in artificial intelligence at EmTech Digital.
Don't be left behind.

March 25-26, 2019
San Francisco, CA

Register now
Illustration by Max Bode
More from Intelligent Machines

Artificial intelligence and robots are transforming how we work and live.

Want more award-winning journalism? Subscribe to Insider Plus.
  • Insider Plus {! insider.prices.plus !}*

    {! insider.display.menuOptionsLabel !}

    Everything included in Insider Basic, plus the digital magazine, extensive archive, ad-free web experience, and discounts to partner offerings and MIT Technology Review events.

    See details+

    Print + Digital Magazine (6 bi-monthly issues)

    Unlimited online access including all articles, multimedia, and more

    The Download newsletter with top tech stories delivered daily to your inbox

    Technology Review PDF magazine archive, including articles, images, and covers dating back to 1899

    10% Discount to MIT Technology Review events and MIT Press

    Ad-free website experience

You've read of three free articles this month. for unlimited online access. You've read of three free articles this month. for unlimited online access. This is your last free article this month. for unlimited online access. You've read all your free articles this month. for unlimited online access. You've read of three free articles this month. for more, or for unlimited online access. for two more free articles, or for unlimited online access.