Automating the Data Scientists
Software that can discover patterns in data and write a report on its findings could make it easier for companies to analyze it.
Many organizations have more data than they’re able to interpret.
Whether your business is fighting cancer, serving online ads, or governing a country, employees who can dissect and explain complex data have become indispensable.
Now researchers backed by Google are developing software that could automate some of the work performed by such data scientists, in hopes of making sophisticated data skills more widely available. When fed raw data, the “automatic statistician” software spits out a report that uses words and charts to describe the mathematical trends it finds.
“It’s not meant to replace exactly what a statistician would do, but it can help a lot,” says Zoubin Ghahramani, professor of information engineering at the University of Cambridge, who developed the software. “Sometimes it finds patterns that a regular data analyst would not,” he adds.
Computers have made it trivial to run complex mathematical operations on large collections of data, and selling data analysis software is a growing business. But human creativity and expertise is still needed to choose and deploy the methods that can explain the patterns in a data set.
The automatic statistician is one of a handful of tools being built to automate some of that expertise. When the system was given a decade of data on air travel, for example, it produced a nine-page report with four mathematical explanations for trends seen in the data that could be used to produce forecasts.
Ghahramani recently received a $750,000 grant from Google to support the project. Later this year, a version of the automatic statistician will be made available online. After that, Ghahramani says, he’ll explore the possibility of launching a commercial version, while also continuing his research.
The automatic statistician draws on a large collection of statistical techniques that can be combined like building blocks to create different mathematical models, says Ghahramani. The software first tries out the simplest of those methods on the data; it then selects the ones that best explain the data for another round of experimentation, adding more mathematical techniques to see what happens. The best model is then used to generate the final written report.
The reports focus strictly on the data, not on what’s going on in the real world. For example, although the automatic statistician came up with a way to mathematically describe a regular surge in airline activity seen every summer, it didn’t suggest that it might be vacation travel. However, Ghahramani says, this still provides a useful starting point for human data analysts who could provide such interpretations or further analysis.
A report by the U.K.’s Royal Statistical Society last year warned of a “crunch” in the supply of data scientists, as demand for data skills grows from all kinds of industries. LinkedIn has reported that members of its service claiming statistical skills were the most likely to get a new job or attract the interest of recruiters in 2014.
If the automatic statistician does turn into a commercial product, it will join a crowded field of services aiming to help companies make more of their data.
A company called Skytree this week launched what it claims is the first commercial tool that can automatically select the best model to explain a particular data set. Unlike the automatic statistician, that “automodeler” can’t produce written reports of its work. Skytree’s customers include insurers and credit card companies that use the service to catch instances of fraud.
Skytree’s chief scientist, Alex Gray, also an associate professor at Georgia Tech, says the automatic statistician is an interesting research project but its methods aren’t efficient enough to handle very large data sets.
Another company, Narrative Science, offers a service that turns numerical data into readable reports (see “Robot Journalist Finds New Work on Wall Street”). Cofounder Kristian Hammond, who is also a professor at Northwestern University, says the automatic statistician could help data scientists be more efficient. But its reports would offer little to those who are unfamiliar with statistics, he says. Most business people don’t want to know about mathematical models, says Hammond, “they want to know that they could save money by reducing factory activity by 50 percent between the hours of 1 a.m. and 6 a.m.”