Skip to Content

Automating the Data Scientists

Software that can discover patterns in data and write a report on its findings could make it easier for companies to analyze it.
February 13, 2015

Whether your business is fighting cancer, serving online ads, or governing a country, employees who can dissect and explain complex data have become indispensable.

Now researchers backed by Google are developing software that could automate some of the work performed by such data scientists, in hopes of making sophisticated data skills more widely available. When fed raw data, the “automatic statistician” software spits out a report that uses words and charts to describe the mathematical trends it finds.

“It’s not meant to replace exactly what a statistician would do, but it can help a lot,” says Zoubin Ghahramani, professor of information engineering at the University of Cambridge, who developed the software. “Sometimes it finds patterns that a regular data analyst would not,” he adds.

Computers have made it trivial to run complex mathematical operations on large collections of data, and selling data analysis software is a growing business. But human creativity and expertise is still needed to choose and deploy the methods that can explain the patterns in a data set.

The automatic statistician is one of a handful of tools being built to automate some of that expertise. When the system was given a decade of data on air travel, for example, it produced a nine-page report with four mathematical explanations for trends seen in the data that could be used to produce forecasts.

Ghahramani recently received a $750,000 grant from Google to support the project. Later this year, a version of the automatic statistician will be made available online. After that, Ghahramani says, he’ll explore the possibility of launching a commercial version, while also continuing his research.

The automatic statistician draws on a large collection of statistical techniques that can be combined like building blocks to create different mathematical models, says Ghahramani. The software first tries out the simplest of those methods on the data; it then selects the ones that best explain the data for another round of experimentation, adding more mathematical techniques to see what happens. The best model is then used to generate the final written report.

The reports focus strictly on the data, not on what’s going on in the real world. For example, although the automatic statistician came up with a way to mathematically describe a regular surge in airline activity seen every summer, it didn’t suggest that it might be vacation travel. However, Ghahramani says, this still provides a useful starting point for human data analysts who could provide such interpretations or further analysis.

A report by the U.K.’s Royal Statistical Society last year warned of a “crunch” in the supply of data scientists, as demand for data skills grows from all kinds of industries. LinkedIn has reported that members of its service claiming statistical skills were the most likely to get a new job or attract the interest of recruiters in 2014.

If the automatic statistician does turn into a commercial product, it will join a crowded field of services aiming to help companies make more of their data.

A company called Skytree this week launched what it claims is the first commercial tool that can automatically select the best model to explain a particular data set. Unlike the automatic statistician, that “automodeler” can’t produce written reports of its work. Skytree’s customers include insurers and credit card companies that use the service to catch instances of fraud.

Skytree’s chief scientist, Alex Gray, also an associate professor at Georgia Tech, says the automatic statistician is an interesting research project but its methods aren’t efficient enough to handle very large data sets.

Another company, Narrative Science, offers a service that turns numerical data into readable reports (see “Robot Journalist Finds New Work on Wall Street”). Cofounder Kristian Hammond, who is also a professor at Northwestern University, says the automatic statistician could help data scientists be more efficient. But its reports would offer little to those who are unfamiliar with statistics, he says. Most business people don’t want to know about mathematical models, says Hammond, “they want to know that they could save money by reducing factory activity by 50 percent between the hours of 1 a.m. and 6 a.m.”

Keep Reading

Most Popular

Rendering of Waterfront Toronto project
Rendering of Waterfront Toronto project

Toronto wants to kill the smart city forever

The city wants to get right what Sidewalk Labs got so wrong.

Muhammad bin Salman funds anti-aging research
Muhammad bin Salman funds anti-aging research

Saudi Arabia plans to spend $1 billion a year discovering treatments to slow aging

The oil kingdom fears that its population is aging at an accelerated rate and hopes to test drugs to reverse the problem. First up might be the diabetes drug metformin.

Yann LeCun
Yann LeCun

Yann LeCun has a bold new vision for the future of AI

One of the godfathers of deep learning pulls together old ideas to sketch out a fresh path for AI, but raises as many questions as he answers.

images created by Google Imagen
images created by Google Imagen

The dark secret behind those cute AI-generated animal images

Google Brain has revealed its own image-making AI, called Imagen. But don't expect to see anything that isn't wholesome.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.