Watson on Jeopardy, Part 2

The IBM machine’s mistakes offered insights about how it works.

Henry Liebermanarchive page

February 15, 2011

Watching the first night of the Jeopardy match pitting the IBM Watson program against human contestants was great fun. One nice touch was the “backstage” display that showed three answers Watson considered for each question and the machine’s confidence in them. That’s interesting, because it gives you some insight into the range of things it was considering.

Some of the categories were obviously softballs for Watson. One category, “Beatles People,” was easy because simply matching song lyrics would get the program a long way (but not all the way) to finding the answer. The rules of the game prohibited the computer from going out on the Web to find answers. Watson has to rely on its own resources, stored in advance. But in its 15 petabytes of storage, Watson basically has, more or less, a copy of a good swath of the Web.

Obviously, it had a copy of the Beatles lyrics that it was searching. Otherwise it wouldn’t have had a prayer on those questions.

Watson ended the first round tied for first, with $5,000; Ken Jennings was third with $2,000. But to get an idea of how well Watson really did, you can run your own contest at home, against what is Watson’s real competitor. Not Brad Rutter or Ken Jennings, but a search engine like Google. Simply type in the clue to Google and see what you get. Like Watson, Google analyzes huge quantities of text, counting words and keeping track of how often words tend to occur together. Like Watson, Google uses multiple approaches to analyze text, and then has a kind of “voting” scheme to figure out how confident it is of the answer.

There are many differences between Watson and Google, but doing that will give you a good feel for the problem. A lot of the time, what you will get is some Web pages that have the answer somewhere within them, but picking the answer out of whatever is on the page, ads and all, is no mean feat. Understanding what constitutes an answer is the central problem.

Interestingly, where Watson failed was sometimes more instructive than when it succeeded.

Clue: It was this anatomical oddity of US gymnast George Eyser….

Ken Jennings’ answer: Missing a hand (wrong)

Watson’s answer: leg (wrong)

Correct answer: Missing a leg

What Watson failed to realize was that the word “leg,” by itself, wasn’t actually an answer to the question. This is common sense for people, because “leg” is an anatomical part, not an anatomical oddity, though Watson did realize that legs were involved somehow. What happened here might have been something more profound than a simple bug. David Ferrucci, Watson’s project leader, attributed the failure to the difficulty of the word “oddity” in the question. To understand what might be odd, you have to compare it to what isn’t odd—that is to say, what’s common sense. A problem with Watson’s approach is that if some sentence appears in its database, it can’t tell whether someone put it there just because it’s true, or because someone felt it was so unusual that it needed to be said.

A computer that lacks common sense, unfortunately, isn’t an oddity. Maybe it should be.

Henry Lieberman is a research scientist who works on artificial intelligence at the Media Laboratory at MIT.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.