But Crawford and Boyd’s work shows that studying large data still requires finesse. Twitter, which is commonly scrutinized for insights about people’s moods, attitudes toward politics, and other aspects of daily life, presents a number of problems, the researchers say. About 40 percent of Twitter’s active users sign in to listen, not to post, which, Crawford and Boyd say, suggests that posts could come from a certain type of person, rather than a random sample. They also note that few researchers have access to all Twitter posts—most use smaller samples provided by the company. Without better information about how those samples were collected, studies could arrive at skewed results, they argue.
Crawford notes that many big data sets—particularly social data—come from companies that have no obligation to support scientific inquiry. Getting access to the data might mean paying for it, or keeping the company happy by not performing certain types of studies.
The researchers add that big data can also raise serious ethical concerns.
Many times, Crawford notes, combining data from different sources can lead to unexpected results for the people involved. For example, other researchers have previously shown that they can identify individuals by using social media data in combination with supposedly anonymized behavioral data provided by companies.
Jennifer Chayes, managing director of Microsoft Research New England, says her lab has had firsthand experience with such problems. The lab wanted to run a contest for researchers to analyze a set of search data, she says, and was going over the data carefully to avoid the sorts of deanonymizing scandals that have occurred from search data releases in the past. They discovered that people often entered search terms that were personally identifying and embarrassing—such as, “Is my wife Jane Doe cheating on me?” The lab nixed the contest. Chayes says, “We began to realize how much we didn’t understand about human behavior around search engines.”
Handling big data sets takes almost impossible care, agrees Alessandro Acquisti, an associate professor at Carnegie Mellon who has studied the unintended information that data sets can reveal. Even public data sets raise questions, such as what to do with information that people post and then subsequently want to delete, he says.
Given the quantity of information now available on the Internet, Crawford argues, researchers need to slow down and think about the methods they use. “[The effect of the availability of big data] did shock a lot of people,” she says. “And it should.”