The reams of data that many modern businesses collect—dubbed “big data”—can provide powerful insights. It is the key to Netflix’s recommendation engines, Facebook’s social ads, and even Amazon’s methods for speeding up the new Web browser, Silk, which comes with its new Fire tablet.
But big data is like any powerful tool. Using it carelessly can have dangerous results.
A new paper presented at a recent Symposium on the Dynamics of the Internet and Society spells out the reasons that businesses and academics should proceed with caution. While privacy invasions—both deliberate and accidental—are obvious issues, the paper also warns that data can easily be incomplete and distorted.
“With big data comes big responsibilities,” says Kate Crawford, an associate professor at the University of New South Wales, who was involved with the work. “There’s been the emergence of a philosophy that big data is all you need,” she adds. “We would suggest that, actually, numbers don’t speak for themselves.”
Crawford’s paper, written with Microsoft senior researcher Danah Boyd, illustrates the ways that big data sets can fall down, particularly when used to make claims about people’s behavior. “Big data sets are never complete,” Crawford says. For example, researchers often study Facebook to analyze people’s social relationships, using connections made through the social network as a stand-in for real-world ties. But it’s common for Facebook to show a distorted picture of people’s closest social relationships, such as with parents, live-in romantic partners, or friends seen daily. “Facebook is not the world,” Crawford says.
Google is a poster child for the power of data. The company has transformed a massive amount of information, gathered through its search engine, into a commanding ad network and powerful role as the gatekeeper of much of the world’s information.
At a conference on Knowledge Discovery and Data Mining in August, I watched Google’s director of research, Peter Norvig, demonstrate the true power of a large data set, using the example of machine translation. Norvig showed that training algorithms on very large data sets, like those it has collected from the many Web pages it crawls that are available in multiple languages, can produce dramatic results. With enough data, Norvig said, even the worst algorithm performs far better than what can be achieved with a smaller data set.