Hundreds of millions of people visit Yahoo’s news sites each month. Today the company released a huge trove of information about the news-reading habits of some 20 million of them in an attempt to help researchers invent software that’s better at predicting what we want.
The giant 13-terabyte data set (13,000 gigabytes) is drawn from activity on Yahoo sites between February and May of last year and is being made available only to academic researchers. Yahoo says the data set is the largest to ever be made freely available, besting a one-terabyte data set released by the online ad company Criteo last year.
Suju Rajan, director of research for personalization science at Yahoo Labs, says the data provides a valuable testbed on which to train and test algorithms that try to understand what people like based on their past behavior. “This is not just relevant to Yahoo; work on this data set is going to benefit the whole industry,” she said at a news briefing Tuesday.
Recommendation algorithms are crucial to tech companies such as Yahoo, Netflix, Amazon, and Google, which use them to suggest things a person might want to read, watch, or buy. Academics rarely get a chance to work with data on people’s real behavior at the scale of corporate data scientists, but are much freer to explore new ideas which could offer major improvements, says Rajan.
The newly released data includes the headlines that Yahoo’s personalization algorithms showed to people, a summary of the content of the articles, and which articles people clicked on. The logs for some seven million of Yahoo’s users include basic demographic information such as age, gender, and location.
Kristian Hammond, a professor at Northwestern University and chief scientist at Narrative Science, welcomed the release. “If the data are good, then I think there’s a tremendous benefit to having this,” he says.
Hammond notes that Yahoo’s data dump provides a useful counterpoint to Google’s recent release of a software package it uses for large-scale machine learning (see “Here’s What Developers Are Doing With Google’s AI Brain”). “Most people don’t have giant data sets like that package is designed for,” he says. As well as recommendation algorithms, Yahoo’s data could reveal patterns in the interests of different demographics, says Hammond.
Hammond notes that releasing information about people’s online activity does not come without risks. AOL accidentally exposed the identities and private thoughts of some of its customers in 2006 when it released search logs for 650,000 people without properly scrubbing the data. Rajan says that without names or other identifying information, knowing the news articles an anonymous user clicked on doesn’t pose such a risk. Hammond says that some people will try anyway.
The big new idea for making self-driving cars that can go anywhere
The mainstream approach to driverless cars is slow and difficult. These startups think going all-in on AI will get there faster.
Inside Charm Industrial’s big bet on corn stalks for carbon removal
The startup used plant matter and bio-oil to sequester thousands of tons of carbon. The question now is how reliable, scalable, and economical this approach will prove.
The dark secret behind those cute AI-generated animal images
Google Brain has revealed its own image-making AI, called Imagen. But don't expect to see anything that isn't wholesome.
The hype around DeepMind’s new AI model misses what’s actually cool about it
Some worry that the chatter about these tools is doing the whole field a disservice.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.