Some things—fog in San Francisco or traffic in New York City—are easy to predict. Others, such as the way a stock market will react to big trades, or the progression of an HIV patient’s illness, are far more complicated. That’s where a startup called Kaggle comes in. It organizes contests in which participants attempt to make seemingly impossible predictions by analyzing mountains of data.
Kaggle corrals thousands of people with backgrounds in data science, including PhDs, graduate students, professors, and people who work at companies such as IBM and Google, offering them the chance to compete to solve companies’ big-data conundrums in exchange for cash. Users take data provided by contest sponsors and compete using custom-made algorithms to find patterns and make the most accurate predictions. You might think of it as a predictive-modeling death match.
Created by Australian economist Anthony Goldbloom, Kaggle was inspired partly by a competition Netflix held from 2006 to 2009. The company offered $1 million to the team that could improve the accuracy of its movie-recommendation software by 10 percent.
The popularity of the Netflix competition showed Goldbloom how many people were interested in working on companies’ data-related conundrums. His 2008 internship at The Economist exposed him to plenty of companies with data that could be mined for valuable insights, but without the right people to study it.
He bet there was room for a company that would bring these two sides together, and figured that giving it a competitive twist would provide better results.
He was on to something. Since launching in April 2010 with a prize of $1,000 for the team that could most accurately predict how countries would vote in the annual Eurovision Song Contest, Kaggle has run 30 different competitions, five of which are still in progress.
And Kaggle’s community, which has grown to about 27,000 people, is getting results. In one early challenge, a Drexel University academic provided anonymous HIV patient records containing genetic marker data that he hoped could be used to predict the progression of the virus. Within a week and a half, Kaggle users could predict the progression of the virus with 70 percent accuracy, when comparing their predictions with known data—a milestone academic research reached only after four years of effort. By the end of the three-month competition, site users had created a model that reduced the previous error rate by about a third and increased the accuracy of predictions to 77 percent.
Goldbloom says the site’s appeal for competitors is the intoxicating feeling of rising on the leader boards. Those who submit the best solutions rise to the top of the leader board for that competition, something that users love. “You want to keep climbing the ladder,” Goldbloom says.
Will Cukierski, a biomedical engineering doctoral student at Rutgers University, not only likes climbing the ladder, but also sees the competitions as a way to get a toehold in the job market. He’s participated in about half a dozen Kaggle competitions, winning first place in one and getting near the top in others. “It’s a little bit of fun and a little bit of business,” he says.
Though most of the people working on Kaggle’s competitions have backgrounds in data mining, winners usually come from a different field than the one the competition represents–probably because they’re able to approach the problem from a new angle, Goldbloom says.
Barbara Chow, education director for the William and Flora Hewlett Foundation, is hoping this outside-the-box thinking helps her group’s challenge, which seeks a better way to automatically score student essays. The contest, which offers a $60,000 grand prize and ends April 30, is running concurrently with a private competition that includes major companies already working in the automated essay scoring field.
Though she’s not sure if Kaggle’s community will come up with the best result, Chow said the Hewlett Foundation decided to experiment with running the challenge since the site has “great access to the right people.”
Cukierski is one of these people—his team is hard at work on the competition, trying to best current automated offerings and create a solution that approaches the grades humans give. How are they doing so far? “Our preliminary results show we’re already pretty close to the humans,” he says.