Skip to Content
Uncategorized

Were Amazon’s Outages Inevitable?

It doesn’t seem possible to keep services up and running in the face of every possible problem.
August 10, 2011

Amazon is still working to recover from outages this week, particularly problems resulting from a Sunday lightning strike in Dublin that knocked service offline for many European customers. Information Week reports that the company is struggling to restore data to some affected customers:

An error embedded in a piece of Amazon Web Services cleanup software has resulted in some customers having their backup data snapshots deleted from EC2’s European data center. Amazon dashboard notices of the problem indicate most of the data was recoverable but it’s not clear whether that happened in every instance.

The European outage has been longer in duration and harder to fix, but a 40-minute outage in North America Monday night illustrates just how significant Amazon Web Services has becoming to the proper functioning of the Internet. At the time of the Internet, Tech Crunch’s MG Siegler wrote:

Are you trying to use the web right now? Just stop. It’s largely broken.
As indicated by about 20 tips in the last few minutes and pretty much all of Twitter, Amazon’s EC2 service appears to be down. That means services like Reddit, Heroku, Foursquare, Instagram, Fab, Quora, Turntable.fm, Netflix and many, many others are down.

There’s a high cost for outages, both in lost transactions and lost customer trust, but Amazon’s recent troubles illustrate how difficult it is to protect against all possible scenarios, according to Information Week:

Amazon needed disaster recovery capability with live data replication to be in place for many customers to avoid being caught in the outage. … To avoid being caught in the European outage, Amazon customers would have had to take extraordinary measures to protect themselves before the incident occurred, said Indu Kodukula, CTO of SunGard Availability Services, a disaster recovery specialist firm.

And lest anyone think that Amazon is the only company subject to outages, the content-delivery network Akamai, responsible for delivering websites such as Apple.com, also experienced an outage this week. Dan Rayburn of Seeking Alpha writes:

One thing the Akamai and Amazon outages should prove to everyone is that even though all the CDNs always talk about the redundancy built into their networks, ALL networks have outages at one time or another. There has never been a network that hasn’t had a major outage and there is no such thing as 100% up-time, no matter what any CDN claims or guarantees in an SLA.

This really underscores the inherent problems of maintaining web-scale systems. As I wrote in April, at the time of the last major Amazon outage:

“It’s not just individual systems that can fail,” says Neil Conway, a PhD student at the University of California, Berkeley, who works on a research project involving large-scale and complex computing platforms. “One failure event can have all of these cascading effects.” A similar problem led to a temporary failure of Amazon’s Simple Storage Service in 2008.

One of the biggest challenges, Conway says, is that “testing is almost impossible, because by definition these are unusual situations.” He adds that it’s difficult to simulate the behavior of a system as large and complex as Amazon Web Services, or even to know what to simulate.

Conway expects companies and researchers to look into new ways of testing abnormal situations for cloud computing systems. “The severity of the outage and the time it took [Amazon] to recover will draw a lot of people’s attention,” he says.

Keep Reading

Most Popular

Europe's AI Act concept
Europe's AI Act concept

A quick guide to the most important AI law you’ve never heard of

The European Union is planning new legislation aimed at curbing the worst harms associated with artificial intelligence.

Uber Autonomous Vehicles parked in a lot
Uber Autonomous Vehicles parked in a lot

It will soon be easy for self-driving cars to hide in plain sight. We shouldn’t let them.

If they ever hit our roads for real, other drivers need to know exactly what they are.

crypto winter concept
crypto winter concept

Crypto is weathering a bitter storm. Some still hold on for dear life.

When a cryptocurrency’s value is theoretical, what happens if people quit believing?

chasm concept
chasm concept

Artificial intelligence is creating a new colonial world order

An MIT Technology Review series investigates how AI is enriching a powerful few by dispossessing communities that have been dispossessed before.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at customer-service@technologyreview.com with a list of newsletters you’d like to receive.