Skip to Content

Were Amazon’s Outages Inevitable?

It doesn’t seem possible to keep services up and running in the face of every possible problem.
August 10, 2011

Amazon is still working to recover from outages this week, particularly problems resulting from a Sunday lightning strike in Dublin that knocked service offline for many European customers. Information Week reports that the company is struggling to restore data to some affected customers:

An error embedded in a piece of Amazon Web Services cleanup software has resulted in some customers having their backup data snapshots deleted from EC2’s European data center. Amazon dashboard notices of the problem indicate most of the data was recoverable but it’s not clear whether that happened in every instance.

The European outage has been longer in duration and harder to fix, but a 40-minute outage in North America Monday night illustrates just how significant Amazon Web Services has becoming to the proper functioning of the Internet. At the time of the Internet, Tech Crunch’s MG Siegler wrote:

Are you trying to use the web right now? Just stop. It’s largely broken.
As indicated by about 20 tips in the last few minutes and pretty much all of Twitter, Amazon’s EC2 service appears to be down. That means services like Reddit, Heroku, Foursquare, Instagram, Fab, Quora,, Netflix and many, many others are down.

There’s a high cost for outages, both in lost transactions and lost customer trust, but Amazon’s recent troubles illustrate how difficult it is to protect against all possible scenarios, according to Information Week:

Amazon needed disaster recovery capability with live data replication to be in place for many customers to avoid being caught in the outage. … To avoid being caught in the European outage, Amazon customers would have had to take extraordinary measures to protect themselves before the incident occurred, said Indu Kodukula, CTO of SunGard Availability Services, a disaster recovery specialist firm.

And lest anyone think that Amazon is the only company subject to outages, the content-delivery network Akamai, responsible for delivering websites such as, also experienced an outage this week. Dan Rayburn of Seeking Alpha writes:

One thing the Akamai and Amazon outages should prove to everyone is that even though all the CDNs always talk about the redundancy built into their networks, ALL networks have outages at one time or another. There has never been a network that hasn’t had a major outage and there is no such thing as 100% up-time, no matter what any CDN claims or guarantees in an SLA.

This really underscores the inherent problems of maintaining web-scale systems. As I wrote in April, at the time of the last major Amazon outage:

“It’s not just individual systems that can fail,” says Neil Conway, a PhD student at the University of California, Berkeley, who works on a research project involving large-scale and complex computing platforms. “One failure event can have all of these cascading effects.” A similar problem led to a temporary failure of Amazon’s Simple Storage Service in 2008.

One of the biggest challenges, Conway says, is that “testing is almost impossible, because by definition these are unusual situations.” He adds that it’s difficult to simulate the behavior of a system as large and complex as Amazon Web Services, or even to know what to simulate.

Conway expects companies and researchers to look into new ways of testing abnormal situations for cloud computing systems. “The severity of the outage and the time it took [Amazon] to recover will draw a lot of people’s attention,” he says.

Keep Reading

Most Popular

Geoffrey Hinton tells us why he’s now scared of the tech he helped build

“I have suddenly switched my views on whether these things are going to be more intelligent than us.”

ChatGPT is going to change education, not destroy it

The narrative around cheating students doesn’t tell the whole story. Meet the teachers who think generative AI could actually make learning better.

Meet the people who use Notion to plan their whole lives

The workplace tool’s appeal extends far beyond organizing work projects. Many users find it’s just as useful for managing their free time.

Learning to code isn’t enough

Historically, learn-to-code efforts have provided opportunities for the few, but new efforts are aiming to be inclusive.

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.