Skip to Content

Amazon’s $150 Million Typo Is a Lightning Rod for a Big Cloud Problem

A botched command inadvertently took down swaths of the Web, but it only serves to reveal that centralized Web services need to be built more robustly.

If you’re going to put all your data in the cloud, you want it to be a well-built cloud. This week, Amazon—the world’s largest provider of such infrastructure—showed that construction skills are still lacking.

On Tuesday, large parts of the Internet simply stopped working. Slack wouldn’t let people communicate with colleagues, Trello wouldn’t let you manage a project, and, sadly, the MIT Technology Review website wouldn’t let you read about emerging technology. There were also complaints about smart-home hardware failing to work properly.

The reason: Amazon’s S3 cloud storage system failed. Amazon is the world’s largest cloud computing provider, so many services that rely upon it were also unable to function properly. And this wasn’t just a blip: the problem took at least four hours to fix.

It’s hard to accurately quantify the true cost of such an outage. But, according to the Wall Street Journal, analytics firm Cyence has estimated that it cost S&P 500 companies at least $150 million. And the traffic monitoring firm Apica claims that 54 of the top 100 online retailers saw site performance slump by at least 20 percent. So there’s no way around the fact that it was expensive.

That makes the reason it happened all the more embarrassing. In a statement describing what went wrong, Amazon has admitted that the root cause of the outage was an incorrect command executed by a staff member at its Northern Virginia facility during routine maintenance. Sadly, it resulted in a catastrophic cascade of events.

The worker was supposed to take a small number of servers offline, but made a mistake and took more servers out than intended—including two that were used to power fundamental processes used across the entire system. The mistake essentially wiped out the facility’s ability to process user requests.

Amazon operates multiple cloud "areas" dotted around the world, and customers of its services are able to store files and run code on more than one of them. But it’s more expensive and, as the Register notes, even companies that do run their services across a number of the different geographies found their systems falling over, likely due to capacity issues.

Just four days before the outage, we described the inherent risks of centralized Web services and speculated about the impact that would be felt if Amazon’s cloud service failed. At the time, we warned that “the stakes are high,” arguing that “security, reliability, and competency” are vital—and perhaps underrepresented—for companies that provide centralized Web services.

Amazon appears to agree. It’s already put in place safeguards so that incidents like the one brought about by the ham-fisted staff member can’t shut down as many servers quite as quickly in the future.

That’s a start. But it’s clear at this point that cloud services need extra insurance policies if they’re to be robust. Amazon, for instance, shouldn’t have even been able to wind up in a situation where its entire Northern Virginia facility could fail at once—instead, it should be split up into separate sub-systems which work independently.

Even then, centralized Web services may still be vulnerable. If a hacker levels a huge attack at a single provider—using a botnet, for instance—he could still force large parts of the Web offline again. But at least it wouldn’t be the result of a typo.

(Read more: Wall Street Journal, the Register, AP, Amazon Web Services, “Centralized Web Services Are Wonderful—Until They Go Wrong,” “10 Breakthrough Technologies: Botnets of Things”)

Keep Reading

Most Popular

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

This baby with a head camera helped teach an AI how kids learn language

A neural network trained on the experiences of a single young child managed to learn one of the core components of language: how to match words to the objects they represent.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.