We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.


Amazon’s $150 Million Typo Is a Lightning Rod for a Big Cloud Problem

A botched command inadvertently took down swaths of the Web, but it only serves to reveal that centralized Web services need to be built more robustly.

If you’re going to put all your data in the cloud, you want it to be a well-built cloud. This week, Amazon—the world’s largest provider of such infrastructure—showed that construction skills are still lacking.

On Tuesday, large parts of the Internet simply stopped working. Slack wouldn’t let people communicate with colleagues, Trello wouldn’t let you manage a project, and, sadly, the MIT Technology Review website wouldn’t let you read about emerging technology. There were also complaints about smart-home hardware failing to work properly.

The reason: Amazon’s S3 cloud storage system failed. Amazon is the world’s largest cloud computing provider, so many services that rely upon it were also unable to function properly. And this wasn’t just a blip: the problem took at least four hours to fix.

It’s hard to accurately quantify the true cost of such an outage. But, according to the Wall Street Journal, analytics firm Cyence has estimated that it cost S&P 500 companies at least $150 million. And the traffic monitoring firm Apica claims that 54 of the top 100 online retailers saw site performance slump by at least 20 percent. So there’s no way around the fact that it was expensive.

That makes the reason it happened all the more embarrassing. In a statement describing what went wrong, Amazon has admitted that the root cause of the outage was an incorrect command executed by a staff member at its Northern Virginia facility during routine maintenance. Sadly, it resulted in a catastrophic cascade of events.

The worker was supposed to take a small number of servers offline, but made a mistake and took more servers out than intended—including two that were used to power fundamental processes used across the entire system. The mistake essentially wiped out the facility’s ability to process user requests.

Amazon operates multiple cloud "areas" dotted around the world, and customers of its services are able to store files and run code on more than one of them. But it’s more expensive and, as the Register notes, even companies that do run their services across a number of the different geographies found their systems falling over, likely due to capacity issues.

Just four days before the outage, we described the inherent risks of centralized Web services and speculated about the impact that would be felt if Amazon’s cloud service failed. At the time, we warned that “the stakes are high,” arguing that “security, reliability, and competency” are vital—and perhaps underrepresented—for companies that provide centralized Web services.

Amazon appears to agree. It’s already put in place safeguards so that incidents like the one brought about by the ham-fisted staff member can’t shut down as many servers quite as quickly in the future.

That’s a start. But it’s clear at this point that cloud services need extra insurance policies if they’re to be robust. Amazon, for instance, shouldn’t have even been able to wind up in a situation where its entire Northern Virginia facility could fail at once—instead, it should be split up into separate sub-systems which work independently.

Even then, centralized Web services may still be vulnerable. If a hacker levels a huge attack at a single provider—using a botnet, for instance—he could still force large parts of the Web offline again. But at least it wouldn’t be the result of a typo.

(Read more: Wall Street Journal, the Register, AP, Amazon Web Services, “Centralized Web Services Are Wonderful—Until They Go Wrong,” “10 Breakthrough Technologies: Botnets of Things”)

Be the leader your company needs. Implement ethical AI.
Join us at EmTech Digital 2019.

Register now
More from Connectivity

What it means to be constantly connected with each other and vast sources of information.

Want more award-winning journalism? Subscribe to Insider Online Only.
  • Insider Online Only {! insider.prices.online !}*

    {! insider.display.menuOptionsLabel !}

    Unlimited online access including articles and video, plus The Download with the top tech stories delivered daily to your inbox.

    See details+

    Unlimited online access including all articles, multimedia, and more

    The Download newsletter with top tech stories delivered daily to your inbox

You've read of three free articles this month. for unlimited online access. You've read of three free articles this month. for unlimited online access. This is your last free article this month. for unlimited online access. You've read all your free articles this month. for unlimited online access. You've read of three free articles this month. for more, or for unlimited online access. for two more free articles, or for unlimited online access.