Skip to Content

Giant Yahoo Data Dump Aims to Help Computers Know What You Want

The release of data on the news-reading habits of 20 million Yahoo users could help advance recommendation algorithms.
January 14, 2016

Hundreds of millions of people visit Yahoo’s news sites each month. Today the company released a huge trove of information about the news-reading habits of some 20 million of them in an attempt to help researchers invent software that’s better at predicting what we want.

The giant 13-terabyte data set (13,000 gigabytes) is drawn from activity on Yahoo sites between February and May of last year and is being made available only to academic researchers. Yahoo says the data set is the largest to ever be made freely available, besting a one-terabyte data set released by the online ad company Criteo last year.

Suju Rajan, director of research for personalization science at Yahoo Labs, says the data provides a valuable testbed on which to train and test algorithms that try to understand what people like based on their past behavior. “This is not just relevant to Yahoo; work on this data set is going to benefit the whole industry,” she said at a news briefing Tuesday.

Recommendation algorithms are crucial to tech companies such as Yahoo, Netflix, Amazon, and Google, which use them to suggest things a person might want to read, watch, or buy. Academics rarely get a chance to work with data on people’s real behavior at the scale of corporate data scientists, but are much freer to explore new ideas which could offer major improvements, says Rajan.

The newly released data includes the headlines that Yahoo’s personalization algorithms showed to people, a summary of the content of the articles, and which articles people clicked on. The logs for some seven million of Yahoo’s users include basic demographic information such as age, gender, and location.

Kristian Hammond, a professor at Northwestern University and chief scientist at Narrative Science, welcomed the release. “If the data are good, then I think there’s a tremendous benefit to having this,” he says.

Hammond notes that Yahoo’s data dump provides a useful counterpoint to Google’s recent release of a software package it uses for large-scale machine learning (see “Here’s What Developers Are Doing With Google’s AI Brain”). “Most people don’t have giant data sets like that package is designed for,” he says. As well as recommendation algorithms, Yahoo’s data could reveal patterns in the interests of different demographics, says Hammond.

Hammond notes that releasing information about people’s online activity does not come without risks. AOL accidentally exposed the identities and private thoughts of some of its customers in 2006 when it released search logs for 650,000 people without properly scrubbing the data. Rajan says that without names or other identifying information, knowing the news articles an anonymous user clicked on doesn’t pose such a risk. Hammond says that some people will try anyway.

Keep Reading

Most Popular

VR is as good as psychedelics at helping people reach transcendence

On key metrics, a VR experience elicited a response indistinguishable from subjects who took medium doses of LSD or magic mushrooms.

This nanoparticle could be the key to a universal covid vaccine

Ending the covid pandemic might well require a vaccine that protects against any new strains. Researchers may have found a strategy that will work.

How do strong muscles keep your brain healthy?

There’s a robust molecular language being spoken between your muscles and your brain.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.