Were Amazon’s Outages Inevitable?

It doesn’t seem possible to keep services up and running in the face of every possible problem.

Erica Naonearchive page

August 10, 2011

Amazon is still working to recover from outages this week, particularly problems resulting from a Sunday lightning strike in Dublin that knocked service offline for many European customers. Information Week reports that the company is struggling to restore data to some affected customers:

An error embedded in a piece of Amazon Web Services cleanup software has resulted in some customers having their backup data snapshots deleted from EC2’s European data center. Amazon dashboard notices of the problem indicate most of the data was recoverable but it’s not clear whether that happened in every instance.

The European outage has been longer in duration and harder to fix, but a 40-minute outage in North America Monday night illustrates just how significant Amazon Web Services has becoming to the proper functioning of the Internet. At the time of the Internet, Tech Crunch’s MG Siegler wrote:

Are you trying to use the web right now? Just stop. It’s largely broken.
As indicated by about 20 tips in the last few minutes and pretty much all of Twitter, Amazon’s EC2 service appears to be down. That means services like Reddit, Heroku, Foursquare, Instagram, Fab, Quora, Turntable.fm, Netflix and many, many others are down.

There’s a high cost for outages, both in lost transactions and lost customer trust, but Amazon’s recent troubles illustrate how difficult it is to protect against all possible scenarios, according to Information Week:

Amazon needed disaster recovery capability with live data replication to be in place for many customers to avoid being caught in the outage. … To avoid being caught in the European outage, Amazon customers would have had to take extraordinary measures to protect themselves before the incident occurred, said Indu Kodukula, CTO of SunGard Availability Services, a disaster recovery specialist firm.

And lest anyone think that Amazon is the only company subject to outages, the content-delivery network Akamai, responsible for delivering websites such as Apple.com, also experienced an outage this week. Dan Rayburn of Seeking Alpha writes:

One thing the Akamai and Amazon outages should prove to everyone is that even though all the CDNs always talk about the redundancy built into their networks, ALL networks have outages at one time or another. There has never been a network that hasn’t had a major outage and there is no such thing as 100% up-time, no matter what any CDN claims or guarantees in an SLA.

This really underscores the inherent problems of maintaining web-scale systems. As I wrote in April, at the time of the last major Amazon outage:

“It’s not just individual systems that can fail,” says Neil Conway, a PhD student at the University of California, Berkeley, who works on a research project involving large-scale and complex computing platforms. “One failure event can have all of these cascading effects.” A similar problem led to a temporary failure of Amazon’s Simple Storage Service in 2008.

One of the biggest challenges, Conway says, is that “testing is almost impossible, because by definition these are unusual situations.” He adds that it’s difficult to simulate the behavior of a system as large and complex as Amazon Web Services, or even to know what to simulate.

Conway expects companies and researchers to look into new ways of testing abnormal situations for cloud computing systems. “The severity of the outage and the time it took [Amazon] to recover will draw a lot of people’s attention,” he says.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.