If you’re going to put all your data in the cloud, you want it to be a well-built cloud. This week, Amazon—the world’s largest provider of such infrastructure—showed that construction skills are still lacking.
On Tuesday, large parts of the Internet simply stopped working. Slack wouldn’t let people communicate with colleagues, Trello wouldn’t let you manage a project, and, sadly, the MIT Technology Review website wouldn’t let you read about emerging technology. There were also complaints about smart-home hardware failing to work properly.
The reason: Amazon’s S3 cloud storage system failed. Amazon is the world’s largest cloud computing provider, so many services that rely upon it were also unable to function properly. And this wasn’t just a blip: the problem took at least four hours to fix.
It’s hard to accurately quantify the true cost of such an outage. But, according to the Wall Street Journal, analytics firm Cyence has estimated that it cost S&P 500 companies at least $150 million. And the traffic monitoring firm Apica claims that 54 of the top 100 online retailers saw site performance slump by at least 20 percent. So there’s no way around the fact that it was expensive.
That makes the reason it happened all the more embarrassing. In a statement describing what went wrong, Amazon has admitted that the root cause of the outage was an incorrect command executed by a staff member at its Northern Virginia facility during routine maintenance. Sadly, it resulted in a catastrophic cascade of events.
The worker was supposed to take a small number of servers offline, but made a mistake and took more servers out than intended—including two that were used to power fundamental processes used across the entire system. The mistake essentially wiped out the facility’s ability to process user requests.
Amazon operates multiple cloud "areas" dotted around the world, and customers of its services are able to store files and run code on more than one of them. But it’s more expensive and, as the Register notes, even companies that do run their services across a number of the different geographies found their systems falling over, likely due to capacity issues.
Just four days before the outage, we described the inherent risks of centralized Web services and speculated about the impact that would be felt if Amazon’s cloud service failed. At the time, we warned that “the stakes are high,” arguing that “security, reliability, and competency” are vital—and perhaps underrepresented—for companies that provide centralized Web services.
Amazon appears to agree. It’s already put in place safeguards so that incidents like the one brought about by the ham-fisted staff member can’t shut down as many servers quite as quickly in the future.
That’s a start. But it’s clear at this point that cloud services need extra insurance policies if they’re to be robust. Amazon, for instance, shouldn’t have even been able to wind up in a situation where its entire Northern Virginia facility could fail at once—instead, it should be split up into separate sub-systems which work independently.
Even then, centralized Web services may still be vulnerable. If a hacker levels a huge attack at a single provider—using a botnet, for instance—he could still force large parts of the Web offline again. But at least it wouldn’t be the result of a typo.
Toronto wants to kill the smart city forever
The city wants to get right what Sidewalk Labs got so wrong.
Chinese gamers are using a Steam wallpaper app to get porn past the censors
Wallpaper Engine has become a haven for ingenious Chinese users who use it to smuggle adult content as desktop wallpaper. But how long can it last?
Yann LeCun has a bold new vision for the future of AI
One of the godfathers of deep learning pulls together old ideas to sketch out a fresh path for AI, but raises as many questions as he answers.
The US military wants to understand the most important software on Earth
Open-source code runs on every computer on the planet—and keeps America’s critical infrastructure going. DARPA is worried about how well it can be trusted
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.