But doesn’t mobile computing provide some forms of data that would be especially helpful, like your location—the fact that at a given moment, you might be shopping in a store? Information like that would seem to be quite valuable.
Absolutely. I’m not a total data Luddite. There’s no question that new technologies will provide all kinds of genuinely useful measures that were previously unattainable. The key question is: Just how much of that data do we really need? For instance, do we need a second-by-second log of the shopper’s location? Would it be truly helpful to integrate this series of observations with other behavioral data (e.g., which products the shopper examined)? Or would this just be nice to know? And how much of this data should we save after the trip is completed?
A true data scientist would have a decent sense of how to answer these questions, with an eye toward practical decision-making. But a Big Data zealot might say, “Save it all—you never know when it might come in handy for a future data-mining expedition.” That’s the distinction that separates “old school” and “new school” analysts.
Surely you’re not against machine learning, which has revolutionized fields like language translation, or new database tools like Hadoop?
I make sure my PhD students learn all these emerging technologies, because they are all very important for certain kinds of tasks. Machine learning is very good at classification—putting things in buckets. If I want to know which brand this person is going to buy next, or if this person is going to vote Republican or Democrat, nothing can touch machine learning, and it’s getting better all the time.
The problem is that there are many decisions that aren’t as easily “bucketized”; for instance, questions about “when” as opposed to “which.” Machine learning can break down pretty dramatically in those tasks. It’s important to have a much broader skill set than just machine learning and database management, but many “big data” people don’t know what they don’t know.
You appear to believe that some of the best work in data science was done long ago.
The golden age for predictive behavior was 40 or 50 years ago, when data were really sparse and companies had to squeeze as much insight as they could from them.
Consider Lester Wunderman, who coined the phrase “direct marketing” in the 1960s. He was doing true data science. He said, “Let’s write down everything we know about this customer: what they bought, what catalogue we sent them, what they paid for it.” It was very hard, because he didn’t have a Hadoop cluster to do it for him.
So what did he discover?
The legacy that he (and other old-school direct marketers) gave us is the still-powerful rubric of RFM: recency, frequency, monetary value.
The “F” and the “M” are obvious. You didn’t need any science for that. The “R” part is the most interesting, because it wasn’t obvious that recency, or the time of the last transaction, should even belong in the triumvirate of key measures, much less be first on the list. But it was discovered that customers who did stuff recently, even if they didn’t do a lot, were more valuable than customers who hadn’t been around for a while. That was a big surprise.
Some of those old models are really phenomenal, even today. Ask anyone in direct marketing about RFM, and they’ll say, “Tell me something I don’t know.” But ask anyone in e-commerce, and they probably won’t know what you’re talking about. Or they will use a lot of Big Data and end up rediscovering the RFM wheel—and that wheel might not run quite as smoothly as the original one.
Big Data and data scientists seem to have such a veneer of respectability.
In investing, you have “technical chartists.” They watch [stock] prices bouncing up and down, hitting what is called “resistance” at 30 or “support” at 20, for example. Chartists are looking at the data without developing fundamental explanations for why those movements are taking place—about the quality of a company’s management, for example.
Among financial academics, chartists tend to be regarded as quacks. But a lot of the Big Data people are exactly like them. They say, “We are just going to stare at the data and look for patterns, and then act on them when we find them.” In short, there is very little real science in what we call “data science,” and that’s a big problem.
Does any industry do it right?
Yes: insurance. Actuaries can say with great confidence what percent of people with your characteristics will live to be 80. But no actuary would ever try to predict when you are going to die. They know exactly where to draw the line.
Even with infinite knowledge of past behavior, we often won’t have enough information to make meaningful predictions about the future. In fact, the more data we have, the more false confidence we will have. Not only won’t our hit rate be perfect, it will be surprisingly low. The important part, as both scientists and businesspeople, is to understand what our limits are and to use the best possible science to fill in the gaps. All the data in the world will never achieve that goal for us.