The Best AI Program Still Flunks an Eighth-Grade Science Test

A contest designed to push the limits of artificial intelligence suggests that truly intelligent machines are a long way off.

Will Knightarchive page

February 17, 2016

For all the remarkable progress being made in artificial intelligence, and warnings about the upheaval this might bring, the smartest computer would still struggle to make it through the eighth grade.

A contest organized by researchers at the Allen Institute for Artificial Intelligence (AI2), invited programmers to create a program capable of taking a modified version of a conventional eighth-grade science test. The results of the competition were announced Tuesday at the annual meeting of the Association for the Advancement of Artificial Intelligence (AAAI).

The winner, a contestant based in Israel called Chaim Linhart, combined several established machine-learning techniques with large databases of scientific information to correctly answer 59 percent of the questions. Like other participants, Linhart fed his computer system hundreds of thousands of questions paired with correct answers so that it could learn to come up with the right answer.

A score of almost 60 percent might disappoint most parents, but it is remarkable for a computer. The test used for the contest was, however, simplified slightly to make it practical for computers to attempt. Diagrams were removed, for example, and only questions with multiple-choice answers were used.

Oren Etzioni, an AI researcher who directs AI2, says the eighth-grade-test challenge was meant to push researchers to develop software that isn’t just superficially intelligent.

In recent years, programs designed to perform specific tasks, especially ones that involve some sort of visual or audio processing, have advanced quickly thanks to extremely efficient machine-learning algorithms, especially those based on large neural networks. An impressive recent example is a Google program designed to play the subtle and computationally complex board game Go (see “Google’s AI Masters the Game of Go a Decade Earlier Than Expected”).

This progress has inspired hopes, and fears, that truly intelligent machines may not be so far away. But Etzioni believes new techniques will be needed to achieve even basic competency at more complex tasks, something that the latest results seem to confirm. Likewise, the best means of gauging progress in artificial intelligence, the Turing Test, has proved all too easy to rig using simple tricks.

Thomas Dietterich, a professor at Oregon State University, and president of AAAI, says the AI2 contest is a useful exercise. Dietterich also says that combining different AI techniques will probably be necessary to achieve a much higher test score in the future. “Intelligence is one word, but it refers to many things,” he says. “One of the things I like about this is that it stresses breadth in a way that other tests have not.”

Gary Marcus, a cognitive scientist at New York University and the founder of an AI startup called Geometric Intelligence, is working on an AI competition that will have several different components. “Intelligence is a multi-dimensional variable,” Marcus says. “There’s not going to be one test that’s a divine measure of intelligence.”

Marcus adds that the AI2 competition actually highlights the size of the challenge that remains for AI researchers. “You can do a lot better than just guessing with some pretty primitive techniques, but at the same time 60 percent is a long way from really understanding science,” he adds. “The best thing that comes out of this is that people might realize how hard the questions really are.”

The contest was organized through Kaggle, a popular platform for coӧrdinating contests in which data scientists battle to create the most efficient algorithm for a particular task, like predicting which products will sell the most based on previous sales, or the outcome of the March Madness tournament. Last year, AI2 released a huge data set of questions and answers allowing participants to design algorithms to take the test (see “AI Software Goes Up Against Fourth Graders on Science Test”).

During a talk at the AAAI conference, Etzioni suggested that setting a more difficult test, over a longer time frame, with a larger prize, could inspire more AI researchers to get involved. “I do think that a longer contest and ‘deeper’ AI will be required to get from 60 percent to 80 percent or more,” Etzioni says. “It’s our hypothesis that you can’t do this with cheap tricks—that you have to do something smarter.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.