Hundreds of millions of people visit Yahoo’s news sites each month. Today the company released a huge trove of information about the news-reading habits of some 20 million of them in an attempt to help researchers invent software that’s better at predicting what we want.
The giant 13-terabyte data set (13,000 gigabytes) is drawn from activity on Yahoo sites between February and May of last year and is being made available only to academic researchers. Yahoo says the data set is the largest to ever be made freely available, besting a one-terabyte data set released by the online ad company Criteo last year.
Suju Rajan, director of research for personalization science at Yahoo Labs, says the data provides a valuable testbed on which to train and test algorithms that try to understand what people like based on their past behavior. “This is not just relevant to Yahoo; work on this data set is going to benefit the whole industry,” she said at a news briefing Tuesday.
Recommendation algorithms are crucial to tech companies such as Yahoo, Netflix, Amazon, and Google, which use them to suggest things a person might want to read, watch, or buy. Academics rarely get a chance to work with data on people’s real behavior at the scale of corporate data scientists, but are much freer to explore new ideas which could offer major improvements, says Rajan.
The newly released data includes the headlines that Yahoo’s personalization algorithms showed to people, a summary of the content of the articles, and which articles people clicked on. The logs for some seven million of Yahoo’s users include basic demographic information such as age, gender, and location.
Kristian Hammond, a professor at Northwestern University and chief scientist at Narrative Science, welcomed the release. “If the data are good, then I think there’s a tremendous benefit to having this,” he says.
Hammond notes that Yahoo’s data dump provides a useful counterpoint to Google’s recent release of a software package it uses for large-scale machine learning (see “Here’s What Developers Are Doing With Google’s AI Brain”). “Most people don’t have giant data sets like that package is designed for,” he says. As well as recommendation algorithms, Yahoo’s data could reveal patterns in the interests of different demographics, says Hammond.
Hammond notes that releasing information about people’s online activity does not come without risks. AOL accidentally exposed the identities and private thoughts of some of its customers in 2006 when it released search logs for 650,000 people without properly scrubbing the data. Rajan says that without names or other identifying information, knowing the news articles an anonymous user clicked on doesn’t pose such a risk. Hammond says that some people will try anyway.