Skip to Content

Giant Yahoo Data Dump Aims to Help Computers Know What You Want

The release of data on the news-reading habits of 20 million Yahoo users could help advance recommendation algorithms.
January 14, 2016

Hundreds of millions of people visit Yahoo’s news sites each month. Today the company released a huge trove of information about the news-reading habits of some 20 million of them in an attempt to help researchers invent software that’s better at predicting what we want.

The giant 13-terabyte data set (13,000 gigabytes) is drawn from activity on Yahoo sites between February and May of last year and is being made available only to academic researchers. Yahoo says the data set is the largest to ever be made freely available, besting a one-terabyte data set released by the online ad company Criteo last year.

Suju Rajan, director of research for personalization science at Yahoo Labs, says the data provides a valuable testbed on which to train and test algorithms that try to understand what people like based on their past behavior. “This is not just relevant to Yahoo; work on this data set is going to benefit the whole industry,” she said at a news briefing Tuesday.

Recommendation algorithms are crucial to tech companies such as Yahoo, Netflix, Amazon, and Google, which use them to suggest things a person might want to read, watch, or buy. Academics rarely get a chance to work with data on people’s real behavior at the scale of corporate data scientists, but are much freer to explore new ideas which could offer major improvements, says Rajan.

The newly released data includes the headlines that Yahoo’s personalization algorithms showed to people, a summary of the content of the articles, and which articles people clicked on. The logs for some seven million of Yahoo’s users include basic demographic information such as age, gender, and location.

Kristian Hammond, a professor at Northwestern University and chief scientist at Narrative Science, welcomed the release. “If the data are good, then I think there’s a tremendous benefit to having this,” he says.

Hammond notes that Yahoo’s data dump provides a useful counterpoint to Google’s recent release of a software package it uses for large-scale machine learning (see “Here’s What Developers Are Doing With Google’s AI Brain”). “Most people don’t have giant data sets like that package is designed for,” he says. As well as recommendation algorithms, Yahoo’s data could reveal patterns in the interests of different demographics, says Hammond.

Hammond notes that releasing information about people’s online activity does not come without risks. AOL accidentally exposed the identities and private thoughts of some of its customers in 2006 when it released search logs for 650,000 people without properly scrubbing the data. Rajan says that without names or other identifying information, knowing the news articles an anonymous user clicked on doesn’t pose such a risk. Hammond says that some people will try anyway.

Keep Reading

Most Popular

A Roomba recorded a woman on the toilet. How did screenshots end up on Facebook?

Robot vacuum companies say your images are safe, but a sprawling global supply chain for data from our devices creates risk.

A startup says it’s begun releasing particles into the atmosphere, in an effort to tweak the climate

Make Sunsets is already attempting to earn revenue for geoengineering, a move likely to provoke widespread criticism.

10 Breakthrough Technologies 2023

Every year, we pick the 10 technologies that matter the most right now. We look for advances that will have a big impact on our lives and break down why they matter.

These exclusive satellite images show that Saudi Arabia’s sci-fi megacity is well underway

Weirdly, any recent work on The Line doesn’t show up on Google Maps. But we got the images anyway.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.