Skip to Content

Giant Yahoo Data Dump Aims to Help Computers Know What You Want

The release of data on the news-reading habits of 20 million Yahoo users could help advance recommendation algorithms.
January 14, 2016

Hundreds of millions of people visit Yahoo’s news sites each month. Today the company released a huge trove of information about the news-reading habits of some 20 million of them in an attempt to help researchers invent software that’s better at predicting what we want.

The giant 13-terabyte data set (13,000 gigabytes) is drawn from activity on Yahoo sites between February and May of last year and is being made available only to academic researchers. Yahoo says the data set is the largest to ever be made freely available, besting a one-terabyte data set released by the online ad company Criteo last year.

Suju Rajan, director of research for personalization science at Yahoo Labs, says the data provides a valuable testbed on which to train and test algorithms that try to understand what people like based on their past behavior. “This is not just relevant to Yahoo; work on this data set is going to benefit the whole industry,” she said at a news briefing Tuesday.

Recommendation algorithms are crucial to tech companies such as Yahoo, Netflix, Amazon, and Google, which use them to suggest things a person might want to read, watch, or buy. Academics rarely get a chance to work with data on people’s real behavior at the scale of corporate data scientists, but are much freer to explore new ideas which could offer major improvements, says Rajan.

The newly released data includes the headlines that Yahoo’s personalization algorithms showed to people, a summary of the content of the articles, and which articles people clicked on. The logs for some seven million of Yahoo’s users include basic demographic information such as age, gender, and location.

Kristian Hammond, a professor at Northwestern University and chief scientist at Narrative Science, welcomed the release. “If the data are good, then I think there’s a tremendous benefit to having this,” he says.

Hammond notes that Yahoo’s data dump provides a useful counterpoint to Google’s recent release of a software package it uses for large-scale machine learning (see “Here’s What Developers Are Doing With Google’s AI Brain”). “Most people don’t have giant data sets like that package is designed for,” he says. As well as recommendation algorithms, Yahoo’s data could reveal patterns in the interests of different demographics, says Hammond.

Hammond notes that releasing information about people’s online activity does not come without risks. AOL accidentally exposed the identities and private thoughts of some of its customers in 2006 when it released search logs for 650,000 people without properly scrubbing the data. Rajan says that without names or other identifying information, knowing the news articles an anonymous user clicked on doesn’t pose such a risk. Hammond says that some people will try anyway.

Keep Reading

Most Popular

AV2.0 autonomous vehicles adapt to unknown road conditions concept
AV2.0 autonomous vehicles adapt to unknown road conditions concept

The big new idea for making self-driving cars that can go anywhere

The mainstream approach to driverless cars is slow and difficult. These startups think going all-in on AI will get there faster.

biomass with Charm mobile unit in background
biomass with Charm mobile unit in background

Inside Charm Industrial’s big bet on corn stalks for carbon removal

The startup used plant matter and bio-oil to sequester thousands of tons of carbon. The question now is how reliable, scalable, and economical this approach will prove.

images created by Google Imagen
images created by Google Imagen

The dark secret behind those cute AI-generated animal images

Google Brain has revealed its own image-making AI, called Imagen. But don't expect to see anything that isn't wholesome.

AGI is just chatter for now concept
AGI is just chatter for now concept

The hype around DeepMind’s new AI model misses what’s actually cool about it

Some worry that the chatter about these tools is doing the whole field a disservice.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.