TR: How do you quantitatively measure the accuracy of your system?
JB: We trained Cinematch on 100 million ratings and asked it to predict what the other 3 million would be. We compared ours with the actual answers. We do that every day. We get about 2 million ratings per day and we track the daily fluctuations of the system. We expect to measure submissions to the contest [the same way]. The actual prize dataset is 103 million ratings, but we only released 100 million of them.
TR: In order to win the $1 million prize, a new algorithm needs to improve the accuracy of recommendations by 10 percent over Cinematch. You’re also rewarding a $50,000 “progress” prize each year for the algorithm that shows the most improvement over the previous year’s best algorithm, by at least 1 percent. What will these percentage improvements mean to a Netflix customer?
JB: If you go to the website and rate 100 movies for us, the red stars shown under each movie are personalized for you. We use these ratings to adjust the prediction away from the average recommendation, according to your taste. A three-percent difference, for instance, might make a difference of one-quarter star. We have millions of people rating millions of DVDs, and that quarter-star difference helps us sort the list. The individual movie recommendation might not get so much better, but, overall, the set of recommended movies is very different. Move a battleship a little bit, and it makes a huge difference.
TR: Why are recommendation systems so hard to improve?
JB: One of the reasons is there are no datasets. Many of the machine-learning applications require fairly substantial datasets that easily have millions of data points. There are lots of different approaches to solving the problem, but they all need large datasets. And as with many datasets, once we’ve applied the techniques to those datasets, there’s no place to go.
TR: So you’re looking for an algorithm that tackles the problem in a completely different way than Cinematch?
JB: Correct. As far as we know, there are many good ideas out in the field. We just can’t test them all. We know that there are people who are really on top of the literature who know the ins and outs of [recommendation systems] and we’d really like to know which ones would be better.
TR: What are some approaches, discussed in the literature, which could work, but haven’t been tested with movie recommendations yet?
JB: It’s hard to say. There was an article in Science a few months ago [July 28, 2006] that used an interesting combination of two types of neural networks [a computational method that sorts data similar to the human brain]. One neural network supervises the machine learning and the other steers that learning. At Netflix, we look at correlations between ratings, and that’s a linear model. Not all knowledge can be represented by a linear combination of features. This particular model in Science uses a nonlinear approach. I think that technique could be quite good.
TR: Are there any other pressing technical challenges at Netflix that might be solved by offering a prize?
JB: I wouldn’t want to speculate on more contests. Are there other technical challenges? Absolutely. Beyond the systems challenge of keeping the recommendation engines up and running with an increasing customer base, we also have a huge number of challenges within the company–like trying to ship two millions discs a day to people. And there are interesting challenges ahead as we get ready for the download world [where people can download movies via the Internet]. The company’s filled with tremendous challenges.