The New Big Data
Today’s big data is forcing researchers to find new techniques for knowledge discovery and data mining.
Top scientists from companies such as Google and Yahoo are gathered alongside leading academics at the 17th Association for Computing Machinery (ACM) conference on Knowledge Discovery and Data Mining (KDD) in San Diego this week. They will present the latest techniques for wresting insights from the deluge of data produced nowadays, and for making sense of information that comes in a wider variety of forms than ever before.
Twenty years ago, the only people who cared about so-called “big data”—the only ones who had enormous data sets and the motivation to try to process them—were members of the scientific community, says Usama Fayyad, executive chair of ACM’s Special Interest Group on Knowledge Discovery and Data Mining and former chief data officer at Yahoo. Even then, the results of data mining were impressive. “We were able to solve significant scientific problems that were standing in the field for 30-plus years,” Fayyad says.
The explosive growth of the Internet, however, changed everything. Whether they liked it or not, businesses found themselves operating online and amassing enormous volumes of data about customers and their behavior. As the power of data mining became clear, Fayyad says, so did economic motivations to invest in the field.
Netflix, for example, offered a $1 million prize to any team that could mine its information about users and build a more accurate recommendation system than the one it already had. High-profile examples like this only scratch the surface of the applications for data mining.
“Businesses and industry are increasingly interested in leveraging the data they capture through business processes,” says Chid Apte, director of analytics research at IBM and chair of the conference. In particular, he points to health care, social media, and anything that takes place on the Web.
These days, Internet giants make their money from the information they collect about users and the insights they gain from mining it. Retailers can access complex patterns of shopper behavior to help them stock their stores more profitably. Industry researchers can predict automobile traffic patterns based on congestion, weather, and time of year, and offer the best routes.
Today’s data, however, doesn’t take the familiar form of the database. “The information’s not coming at you in a clean tabular form,” Apte says. “It’s coming at you in a network form.” Often it arrives in a graph, he explains—such as those used by social media. These graphs often record not only the complex connections between nodes but also other types of information in a diversity of formats, such as the videos, images, and comments that people post on social networks.
Social media may have started the trend toward analyzing such graphs, Apte says, but network data comes from other sources as well—for example, from complex engineering systems such as the electric power grid, water distribution systems, and traffic management systems. The distributed sensor networks in these systems produce data sets in which the connections between locations are as important as friendships between individuals in a social network. Understanding such connections is the key to optimizing systems and making them sustainable, Apte says.
People have been working with graphs of data for hundreds of years, but the graphs now being plotted from social networks or sensor networks are of an unprecedented scale, Apte says. “These are gigantic graphs,” he says. “You’re talking about millions of nodes and tens of millions of links.”
Dealing with graphs of that size and scope, and applying modern analytic tools to them, calls for better algorithms and other innovations. Apte says one goal of the conference is to bring cutting-edge techniques from academia and industry research labs to the attention of businesses, so they can apply them more quickly. At the same time, the conference organizers hope, academics will get a sense of the business challenges that most vitally need to be addressed.
Fayyad says that the intense business interest in data has changed the field of data mining. Scientists, he says, mainly dealt with data stored in neat, structured forms. But most of the data that businesses are producing is an unstructured mess.
“While the scientists were getting pretty good at avoiding that stuff, the businesses were being forced to take it head-on,” Fayyad says. “It drove the companies to start developing techniques that no one had ever attempted.”
Certainly, challenges remain, but, Fayyad says, “people are able to come up with a lot more predictive models, and more importantly score them [to determine how well they work] … It takes analysis to a level that’s truly beyond human brain comprehension.”