Rewriting the Rules of Turing’s Imitation Game
Some researchers are searching for more meaningful ways to measure artificial intelligence.
It is surprisingly difficult to measure progress in artificial intelligence.
We have self-driving cars, knowledgeable digital assistants, and software capable of putting names to faces as well as any expert. Google recently announced that it had developed software capable of learning—entirely without human help—how to play several classic Atari computer games with skill far beyond that of even the most callus-thumbed human player.
But do these displays of machine aptitude represent genuine intelligence? For decades artificial-intelligence experts have struggled to find a practical way to answer the question.
AI is an idea so commonplace that few of us bother to interrogate its meaning. If we did, we might discover a problem tucked inside it: defining intelligence is far from straightforward. If the ability to carry out complex arithmetic and algebra is a sign of intellect, then is a digital calculator, in some sense, gifted? If spatial reasoning is part of the story, then is a robot vacuum cleaner that’s capable of navigating its way around a building unaided something of a wunderkind?
The most famous effort to measure machine intelligence does not resolve these questions; instead, it obscures them. In his 1950 paper Computing Machinery and Intelligence, published six years before the term “artificial intelligence” was coined, the British computer scientist Alan Turing considered the capacity of computers to imitate the human intellect. But he discarded the question “Can machines think?” The act of thinking is, he argued, too difficult to define. Instead, he turned to a black-box definition: if we accept humans as an intelligent species, then anything that exhibits behaviors indistinguishable from human behavior must also be intelligent. Turing also proposed a test, called the “imitation game,” in which a computer would prove its intelligence by convincing a person, through conversation, that it is also human. The imitation game was a thought experiment, not a formal scientific test. But as artificial intelligence advanced, the idea took on a life of its own, and the so-called Turing test was born.
In the years since, the Turing test has been widely adopted and also widely criticized—not because of flaws in Turing’s original idea, but because of flaws in its execution. The best-known example is the Loebner Prize, which in 1990 began offering $100,000 for the first computer whose text conversation several judges deemed indistinguishable from that of a human. The Loebner Prize has been derided for allowing entrants to use cheap tricks, like confusing participants with odd diversions, in place of more honest approaches that uphold the spirit of Turing’s premise.
A chatbot called Eugene Goostman made headlines last June for supposedly passing the Turing test in a contest organized at the University of Reading in the U.K. The software convinced 30 percent of the human judges involved that it was human. But as many AI experts pointed out at the time, and as transcripts of conversations with Goostman show, the chatbot relies on obfuscation and subterfuge rather than the natural back and forth of intelligent conversation.
Here’s an excerpt from one exchange, for example:
Scott: Which is bigger, a shoebox or Mount Everest?
Eugene: I can’t make a choice right now. I should think it out later. And I forgot to ask you where you are from …
Scott: How many legs does a camel have?
Eugene: Something between 2 and 4. Maybe, three? :-))) By the way, I still don’t know your specialty—or, possibly, I’ve missed it?
Scott: How many legs does a millipede have?
Eugene: Just two, but Chernobyl mutants may have them up to five. I know you are supposed to trick me.
“The Turing test as it’s been realized in the past few decades, especially by the Loebner competition, is not a valid test for AI,” says Leora Morgenstern, an expert on artificial intelligence who works at Leidos, a defense contractor headquartered in Virginia. “Turing’s original description mandated a freewheeling conversation that could range over any subject, and there was no nonsense allowed,” she says. “If the test taker was asked a question, it needed to answer that question.”
Even more tangible advances, such as Google’s game-playing software, merely emphasize the way AI has fragmented in the decades since the field’s birth as an academic discipline in the 1950s. AI’s earliest proponents hoped to work toward some form of general intelligence. But as the complexity of the task unfurled, research fractured into smaller, more manageable tasks. This produced progress, but it also turned machine intelligence into something that could not easily be compared with human intellect.
“Asking whether an artificial entity is ‘intelligent’ is fraught with difficulties,” says Mark Riedl, an associate professor at Georgia Tech. “Eventually a self-driving car will outperform human drivers. So we can even say that along one dimension, an AI is super-intelligent. But we might also say that it is an idiot savant, because it cannot do anything else, like recite a poem or solve an algebra problem.”
Most AI researchers still pursue highly specialized areas, but some are now turning their attention back to generalized intelligence and considering new ways to measure progress. For Morgenstern, a machine will demonstrate intelligence only when it can show that once it knows one intellectually challenging task, it can easily learn another related task. She gives the example of AI chess players, which are able to play the game at a level few human players can match but are unable to switch to simpler games, such as checkers or Monopoly. “This is true of many intellectually challenging tasks,” says Morgenstern. “You can develop a system that is great at performing a single task, but it is likely that it won’t be able to do seemingly related tasks without a whole lot of programming and tinkering.”
Riedl agrees that the test should be broad: “Humans have broad capabilities. Conversation is just one aspect of human intelligence. Creativity is another. Problem solving and knowledge are others.”
With this in mind, Riedl has designed one alternative to the Turing test, which he has dubbed the Lovelace 2.0 test (a reference to Ada Lovelace, a 19th-century English mathematician who programmed a seminal calculating machine). Riedl’s test would focus on creative intelligence, with a human judge challenging a computer to create something: a story, poem, or drawing. The judge would also issue specific criteria. “For example, the judge may ask for a drawing of a poodle climbing the Empire State Building,” he says. “If the AI succeeds, we do not know if it is because the challenge was too easy or not. Therefore, the judge can iteratively issue more challenges with more difficult criteria until the computer system finally fails. The number of rounds passed produces a score.”
Riedl’s test might not be the ideal successor to the Turing test. But it seems better than setting any single goal. “I think it is ultimately futile to place a definitive boundary at which something is deemed intelligent or not,” Riedl says. “Who is to say being above a certain score is intelligent or being below is unintelligent? Would we ever ask such a question of humans?”
Why does the Turing test remain so well known outside of scientific circles if it is seemingly so flawed? The source of its fame is, perhaps, that it plays on human anxiety about being fooled by our own technology, of losing control of our creations (see “Our Fear of Artificial Intelligence”).
So long as we can’t be imitated, we feel that we are, in a sense, safe. A more rigorous test may prove more practically useful. But for a test to replace Turing’s imitation game in the wider public consciousness it must first capture the public imagination.