Bracing for the Data Deluge

Michael Stonebraker helped invent technology that put databases into every business. Now a growing flood of data means he needs to reinvent it.

Tom Simonitearchive page

May 19, 2011

From Facebook to the Department of Motor Vehicles, the world is catalogued in databases. No one knows it better than MIT adjunct professor and entrepreneur Michael Stonebraker, who has spent the last 25 years developing the technology that made it so. He got his big break by inventing and commercializing technology that underlies most of the databases, known as relational databases, that rule today. But Stonebraker now happily calls his earlier inventions largely obsolete. He’s working on a new generation of database technology that can handle the flood of digital data that is starting to overwhelm established methods.

“Relational databases are omnipresent as the solution for enterprise data. They have been fabulously successful,” Stonebraker says. But he says that the largest database vendors, including Oracle, IBM, and Microsoft, still sell such products as being appropriate for any business. Stonebraker has a different view: that new database technologies are required to handle the exponential increases in the information that businesses must handle. Stonebraker, 67, is already finding success with several of his own new approaches.

One is a database system called C-Store. Unlike most systems in use today, it stores data on disk column by column, not row by row. That simple tweak required a complete rewrite of how databases have worked, but it dovetails neatly with both the way computer memory works and the way databases are accessed. That yields much faster performance and more compressed data.

That tweak and others made by Stonebraker and colleagues at MIT, Brown, Brandeis, Yale, and the University of Massachusetts enabled the launch of Vertica, a company that commercialized C-Store and helped customers to query large databases almost in real time. Vertica was acquired by Hewlett-Packard in February and boasts clients including Comcast, which uses it to monitor the millions of devices that make up its TV and Internet networks, and Groupon, which uses it to analyze the actions of its millions of subscribers.

A related system from Stonebraker and some of the same academic colleagues, H-Store, builds on the same ideas with extra improvements such as running entirely in a computer’s memory, not on disk; this method is particularly useful in online transaction processing. H-Store’s code is open source, but the technology is being commercialized by venture-backed VoltDB, with Stonebraker as CTO. He argues that this kind of use-specific, built-for-speed databasing system is what most enterprises will need to adopt sooner rather than later to deal with the flood of digital data.

Some organizations are already caught in that flood. Consider Facebook. Already host to more digital photos than any other company, Facebook is building new storage and processing infrastructure as fast as it can. Yet it is pushing the database technology it is using to the limit, splitting its famed social graph across 4,000 databases that must all work together as one, Stonebraker says. “They are just dying under the load of the management layer needed to keep this system up,” he says. “They have the hardest database problem on the planet, and there’s no current system that will meet their needs.”

Solutions that Stonebraker is building for a very different sector already drowning in data may eventually help. A few years ago, he heard of the problems facing the Large Synoptic Survey Telescope under construction in Chile. “It is going to assemble 100 petabytes of raw data and derived data,” says Stonebraker, “and they had no clue what to do with that much.”

Stonebraker and collaborator David DeWitt, affiliated with University of Wisconsin-Madison, built a unique database system named SciDB. The open-source project now has venture backing and a large community of volunteers from within science. But Stonebraker thinks features of SciDB will eventually find favor beyond academia.

“All science data is uncertain and has error bars, unlike the data in a salary database, so SciDB can pay attention to uncertainty. It also cannot overwrite, because science guys never want to throw anything away,” he says. Those features are not so different from the need of the high powered, statistics-heavy analytics or “data science” increasingly at the heart of successful, technology-led businesses. One example is online ad placement: targeting every person individually requires computationally intense analysis to cluster similar people together.

However, Stonebraker doesn’t claim that new database systems like those he is working on can be a panacea for companies suddenly learning the limits of more established technologies. The growing importance of data storage and processing to business of all kinds will require them to make both more of a business priority. “If you’re running a company, you’ve got to engineer in scale from the beginning,” he says, “because there’s no doubt you will need it later.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.