Some things—fog in San Francisco or traffic in New York City—are easy to predict. Others, such as the way a stock market will react to big trades, or the progression of an HIV patient’s illness, are far more complicated. That’s where a startup called Kaggle comes in. It organizes contests in which participants attempt to make seemingly impossible predictions by analyzing mountains of data.
Kaggle corrals thousands of people with backgrounds in data science, including PhDs, graduate students, professors, and people who work at companies such as IBM and Google, offering them the chance to compete to solve companies’ big-data conundrums in exchange for cash. Users take data provided by contest sponsors and compete using custom-made algorithms to find patterns and make the most accurate predictions. You might think of it as a predictive-modeling death match.
Created by Australian economist Anthony Goldbloom, Kaggle was inspired partly by a competition Netflix held from 2006 to 2009. The company offered $1 million to the team that could improve the accuracy of its movie-recommendation software by 10 percent.
The popularity of the Netflix competition showed Goldbloom how many people were interested in working on companies’ data-related conundrums. His 2008 internship at The Economist exposed him to plenty of companies with data that could be mined for valuable insights, but without the right people to study it.
He bet there was room for a company that would bring these two sides together, and figured that giving it a competitive twist would provide better results.
He was on to something. Since launching in April 2010 with a prize of $1,000 for the team that could most accurately predict how countries would vote in the annual Eurovision Song Contest, Kaggle has run 30 different competitions, five of which are still in progress.
And Kaggle’s community, which has grown to about 27,000 people, is getting results. In one early challenge, a Drexel University academic provided anonymous HIV patient records containing genetic marker data that he hoped could be used to predict the progression of the virus. Within a week and a half, Kaggle users could predict the progression of the virus with 70 percent accuracy, when comparing their predictions with known data—a milestone academic research reached only after four years of effort. By the end of the three-month competition, site users had created a model that reduced the previous error rate by about a third and increased the accuracy of predictions to 77 percent.