AI Software Goes Up Against Fourth Graders on Science Tests

Making AI software take real school exams might accelerate progress toward machines with common sense.

Tom Simonitearchive page

September 9, 2015

During which season of the year would a rabbit’s fur be thickest? A computer program called Aristo can tell you because it read about bears growing thicker pelts during winter in a fourth-grade study guide, and it knows rabbits are mammals, too. It’s studying for New York State’s standard science exams.

Aristo is being developed by researchers at the Allen Institute for Artificial Intelligence in Seattle, who want to give machines a measure of common sense about the world. The institute’s CEO, Oren Etzioni, says the best way to benchmark the development of their digital offspring is to use tests designed for schoolchildren. He’s trying to convince other AI researchers to adopt standardized school tests as a way to measure progress in the field.

“We can put our understanding of progress in AI and in natural language on an objective footing,” says Etzioni. Being able to compare the merits of different approaches should make it easier to identify promising approaches and could accelerate progress, he says.

Early in October, the Allen Institute will launch a contest challenging researchers to make software to take on eighth-grade science questions. The competition is being hosted at the data science website Kaggle, where entrants will be able to access thousands of practice questions to train their software. A prize of $50,000 will be awarded to the creator of the software that can score best on questions it hasn’t seen before.

Right now, Aristo isn’t close to passing the fourth-grade science test. That requires a score of 65 percent. Aristo can only take on the multiple-choice questions, which make up about two-thirds of the test. It scores about 75 percent on those that don’t involve diagrams and 45 percent on those that do, says Etzioni. Aristo scores 63 percent on eighth-grade multiple-choice questions that exclude diagrams.

You can see Aristo answer selected fourth-grade questions at the Allen Institute website. The software uses reasoning algorithms to answer questions using knowledge harvested from study guides and the Web.

Finding a way to put even a dash of common sense into software is a major challenge in AI, and one that could lead to computers helping us out in many new ways. “If we want to build systems that are robust and more natural for people to work with, they will need these abilities,” says Etzioni.

It’s a view shared by other leading researchers, including Facebook’s expanding AI lab, which wants to enable virtual assistants capable of basic conversation (see “Teaching Machines to Understand Us”). One reason existing virtual assistants such as Apple’s Siri or Microsoft’s Cortana are very limited is that they must get by without common sense. They use what you say to select from a portfolio of preprogrammed rules.

Ernest Davis, a professor at New York University, agrees that a way to benchmark a machine’s common sense could help the field. But he doesn’t think using school tests is a good choice.

Using tests created for children has the advantage of ensuring that researchers don’t accidentally or intentionally make the benchmark for their field too easy, he says. But because children are so much better at understanding the world than machines, tests written for them can’t be used to probe the abilities most important to making progress with intelligent software, he says.

“What’s difficult for humans is very different from what’s difficult for machines,” says Davis, who also works on giving software common sense. “Standardized tests for humans don’t get very good coverage of the kinds of problems that are hard for computers.”

Davis says a better alternative would be to craft exam-style questions specifically for machines. One example question that he’s created suggests a science exam crafted for machines could look very basic to a fourth grader: “Sally’s favorite cow died yesterday. The cow will probably be alive again a) tomorrow; b) within a week; c) within a year; d) within a few years; e) the cow will never be alive again.”

Etzioni counters that although school test questions don’t directly test very basic common sense, they require it implicitly, because it is needed to interpret the questions. Only by using questions crafted for humans can we really say we’re measuring machines against our own standards, he says. “Putting humans and machines on an equal footing makes sense.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.