Embracing the Data
The feeble computers of the past were not up to this job, though. “Years ago,” Abernathy says, “we would see a shift in the data, and we would try to handle it using three or four or five parameters.” Engineers would make educated guesses as to which were the most important variables-the train’s speed, say, and the ambient temperature-and then perform a regression analysis (a standard technique that teases out the effects of different factors on a variable). But no matter how carefully the parameters were chosen or the calculations performed, the model’s predictions were inevitably impaired by the limitation on how much the computer could handle.Today’s technology has removed that limitation. Twenty years ago, statisticians spent a great deal of effort finding clever ways to limit data and still get reasonable answers. Now, empowered with faster number-crunching machines, they embrace the data. “We can look at tens or hundreds of parameters, and we can determine relationships that we could never see before,” Abernathy says. Picking out subtle relationships between variables in operating conditions and in system performance is important, because those relationships, if they are not accounted for, can cause the model to spit out inaccurate results.
Here is where the third key advance-in statistics-comes into play. Statisticians have developed a number of analytical tools that take advantage of this increased computing power to create more accurate projections than are possible with classical regression analysis. Among the most important is a technique using predictive models called “decision trees,” or, more particularly, “classification and regression trees.” This method is well suited for such tasks as predicting whether a locomotive engine will fail on a given outing. It does not assume, as regression analysis does, that the relationship between the input variables (age, distance traveled, operating temperature, oil pressure and so on) and the output variable (whether the machine fails or not) is a matter of simple extrapolation. Given enough data, a decision tree can model virtually any relationship, no matter how complex. It can also handle incomplete data-such as readings made when a sensor is malfunctioning-much more easily than can regression analysis. And whereas regression analysis generally reaches a point of diminishing returns, past which gathering more data will not improve the predictions, that’s not the case with decision trees. With the new method, “more data is always better,” says Jerome Friedman of Stanford University, one of its developers.
The decision tree works, Friedman explains, by dividing a set of data into smaller and smaller partitions until it reaches a best partition for predicting a particular outcome. The data might, for example, consist of thousands of sets of readings made on hundreds of locomotive engines. The outcome in question might be whether a given engine will run smoothly for another 5,000 kilometers. The best indicator of whether an engine will fail might be whether its current operating temperature is above or below a certain level, or perhaps whether it has covered more or less than a certain distance since its last major overhaul. The decision tree begins by slicing the data into the two subsets that best correlate with the two divergent outcomes. Each of the resulting two subsets of data points is then split by the same best-prediction criterion; the resulting four subsets are divided; and so on, until further divisions do not improve the predictive value.
Thanks to such statistical tools, people like GE’s Abernathy can wring an astonishing amount of information from the signals trickling out of the sensors on technological devices. Jet engines cost between $5 million and $10 million and need an overhaul every three to five years. Because this procedure is also expensive-costing $500,000 to $2 million-airlines strive to maximize the time between one overhaul and the next. An equally important goal is to avoid having too many engines due for major servicing at the same time: airlines have only so many spares on hand.