Few ideas hold more sway among entrepreneurs and investors these days than “Big Data.” The idea is that we are now collecting so much information about people from their online behavior and, especially, through their mobile phones that we can make increasingly specific predictions about how they will behave and what they will buy.
But are those assumptions really true? One doubter is Peter Fader, codirector of the Wharton Customer Analytics Initiative at the University of Pennsylvania, where he is also a professor of marketing. Fader shared some of his concerns in an interview with reporter Lee Gomes.
TR: How would you describe the prevailing idea about Big Data inside the tech community?
Fader: “More is better.” If you can give me more data about a customer—if you can capture more aspects of their behavior, their connections with others, their interests, and so on—then I can pin down exactly what this person is all about. I can anticipate what they will buy, and when, and for how much, and through what channel.
So what exactly is wrong with that?
It reminds me a lot of what was going on 15 years ago with CRM (customer relationship management). Back then, the idea was “Wow, we can start collecting all these different transactions and data, and then, boy, think of all the predictions we will be able to make.” But ask anyone today what comes to mind when you say “CRM,” and you’ll hear “frustration,” “disaster,” “expensive,” and “out of control.” It turned out to be a great big IT wild-goose chase. And I’m afraid we’re heading down the same road with Big Data.
There seem to be a lot of businesses these days that promise to take a Twitter stream or a collection of Facebook comments and then make some prediction: about a stock price, about how a product will be received in the market.
That is all ridiculous. If you can get me a really granular view of data—for example, an individual’s tweets and then that same individual’s transactions, so I can see how they are interacting with each other—that’s a whole other story. But that isn’t what is happening. People are focusing on sexy social-media stuff and pushing it much further than they should be.
Some say the data fetish you’re describing is especially epidemic with the many startups connected with mobile computing. Do you think that’s true? And if so, wouldn’t it suggest that a year or two from now, there are going to be a lot of disappointed entrepreneurs and VCs?
There is a “data fetish” with every new trackable technology, from e-mail and Web browsing in the ’90s all the way through mobile communications and geolocation services today. Too many people think that mobile is a “whole new world,” offering stunning insights into behaviors that were inconceivable before. But many of the basic patterns are surprisingly consistent across these platforms. That doesn’t make them uninteresting or unimportant. But the basic methods we can use in the mobile world to understand and forecast these behaviors (and thus the key data needed to accomplish these tasks) are not nearly as radical as many people suspect.
But doesn’t mobile computing provide some forms of data that would be especially helpful, like your location—the fact that at a given moment, you might be shopping in a store? Information like that would seem to be quite valuable.
Absolutely. I’m not a total data Luddite. There’s no question that new technologies will provide all kinds of genuinely useful measures that were previously unattainable. The key question is: Just how much of that data do we really need? For instance, do we need a second-by-second log of the shopper’s location? Would it be truly helpful to integrate this series of observations with other behavioral data (e.g., which products the shopper examined)? Or would this just be nice to know? And how much of this data should we save after the trip is completed?
A true data scientist would have a decent sense of how to answer these questions, with an eye toward practical decision-making. But a Big Data zealot might say, “Save it all—you never know when it might come in handy for a future data-mining expedition.” That’s the distinction that separates “old school” and “new school” analysts.
Surely you’re not against machine learning, which has revolutionized fields like language translation, or new database tools like Hadoop?
I make sure my PhD students learn all these emerging technologies, because they are all very important for certain kinds of tasks. Machine learning is very good at classification—putting things in buckets. If I want to know which brand this person is going to buy next, or if this person is going to vote Republican or Democrat, nothing can touch machine learning, and it’s getting better all the time.
The problem is that there are many decisions that aren’t as easily “bucketized”; for instance, questions about “when” as opposed to “which.” Machine learning can break down pretty dramatically in those tasks. It’s important to have a much broader skill set than just machine learning and database management, but many “big data” people don’t know what they don’t know.
You appear to believe that some of the best work in data science was done long ago.
The golden age for predictive behavior was 40 or 50 years ago, when data were really sparse and companies had to squeeze as much insight as they could from them.
Consider Lester Wunderman, who coined the phrase “direct marketing” in the 1960s. He was doing true data science. He said, “Let’s write down everything we know about this customer: what they bought, what catalogue we sent them, what they paid for it.” It was very hard, because he didn’t have a Hadoop cluster to do it for him.
So what did he discover?
The legacy that he (and other old-school direct marketers) gave us is the still-powerful rubric of RFM: recency, frequency, monetary value.
The “F” and the “M” are obvious. You didn’t need any science for that. The “R” part is the most interesting, because it wasn’t obvious that recency, or the time of the last transaction, should even belong in the triumvirate of key measures, much less be first on the list. But it was discovered that customers who did stuff recently, even if they didn’t do a lot, were more valuable than customers who hadn’t been around for a while. That was a big surprise.
Some of those old models are really phenomenal, even today. Ask anyone in direct marketing about RFM, and they’ll say, “Tell me something I don’t know.” But ask anyone in e-commerce, and they probably won’t know what you’re talking about. Or they will use a lot of Big Data and end up rediscovering the RFM wheel—and that wheel might not run quite as smoothly as the original one.
Big Data and data scientists seem to have such a veneer of respectability.
In investing, you have “technical chartists.” They watch [stock] prices bouncing up and down, hitting what is called “resistance” at 30 or “support” at 20, for example. Chartists are looking at the data without developing fundamental explanations for why those movements are taking place—about the quality of a company’s management, for example.
Among financial academics, chartists tend to be regarded as quacks. But a lot of the Big Data people are exactly like them. They say, “We are just going to stare at the data and look for patterns, and then act on them when we find them.” In short, there is very little real science in what we call “data science,” and that’s a big problem.
Does any industry do it right?
Yes: insurance. Actuaries can say with great confidence what percent of people with your characteristics will live to be 80. But no actuary would ever try to predict when you are going to die. They know exactly where to draw the line.
Even with infinite knowledge of past behavior, we often won’t have enough information to make meaningful predictions about the future. In fact, the more data we have, the more false confidence we will have. Not only won’t our hit rate be perfect, it will be surprisingly low. The important part, as both scientists and businesspeople, is to understand what our limits are and to use the best possible science to fill in the gaps. All the data in the world will never achieve that goal for us.