A View from Erica Naone

Were Amazon's Outages Inevitable?

It doesn’t seem possible to keep services up and running in the face of every possible problem.

  • August 10, 2011

Amazon is still working to recover from outages this week, particularly problems resulting from a Sunday lightning strike in Dublin that knocked service offline for many European customers. Information Week reports that the company is struggling to restore data to some affected customers:

An error embedded in a piece of Amazon Web Services cleanup software has resulted in some customers having their backup data snapshots deleted from EC2’s European data center. Amazon dashboard notices of the problem indicate most of the data was recoverable but it’s not clear whether that happened in every instance.

The European outage has been longer in duration and harder to fix, but a 40-minute outage in North America Monday night illustrates just how significant Amazon Web Services has becoming to the proper functioning of the Internet. At the time of the Internet, Tech Crunch’s MG Siegler wrote:

Are you trying to use the web right now? Just stop. It’s largely broken.
As indicated by about 20 tips in the last few minutes and pretty much all of Twitter, Amazon’s EC2 service appears to be down. That means services like Reddit, Heroku, Foursquare, Instagram, Fab, Quora, Turntable.fm, Netflix and many, many others are down.

There’s a high cost for outages, both in lost transactions and lost customer trust, but Amazon’s recent troubles illustrate how difficult it is to protect against all possible scenarios, according to Information Week:

Amazon needed disaster recovery capability with live data replication to be in place for many customers to avoid being caught in the outage. … To avoid being caught in the European outage, Amazon customers would have had to take extraordinary measures to protect themselves before the incident occurred, said Indu Kodukula, CTO of SunGard Availability Services, a disaster recovery specialist firm.

And lest anyone think that Amazon is the only company subject to outages, the content-delivery network Akamai, responsible for delivering websites such as Apple.com, also experienced an outage this week. Dan Rayburn of Seeking Alpha writes:

One thing the Akamai and Amazon outages should prove to everyone is that even though all the CDNs always talk about the redundancy built into their networks, ALL networks have outages at one time or another. There has never been a network that hasn’t had a major outage and there is no such thing as 100% up-time, no matter what any CDN claims or guarantees in an SLA.

This really underscores the inherent problems of maintaining web-scale systems. As I wrote in April, at the time of the last major Amazon outage:

“It’s not just individual systems that can fail,” says Neil Conway, a PhD student at the University of California, Berkeley, who works on a research project involving large-scale and complex computing platforms. “One failure event can have all of these cascading effects.” A similar problem led to a temporary failure of Amazon’s Simple Storage Service in 2008.

One of the biggest challenges, Conway says, is that “testing is almost impossible, because by definition these are unusual situations.” He adds that it’s difficult to simulate the behavior of a system as large and complex as Amazon Web Services, or even to know what to simulate.

Conway expects companies and researchers to look into new ways of testing abnormal situations for cloud computing systems. “The severity of the outage and the time it took [Amazon] to recover will draw a lot of people’s attention,” he says.

Tech Obsessive?
Become an Insider to get the story behind the story — and before anyone else.
Subscribe today

Uh oh–you've read all five of your free articles for this month.

Insider Premium

$179.95/yr US PRICE

More from undefined

Want more award-winning journalism? Subscribe and become an Insider.

  • Insider Premium {! insider.prices.premium !}*

    {! insider.display.menuOptionsLabel !}

    Our award winning magazine, unlimited access to our story archive, special discounts to MIT Technology Review Events, and exclusive content.

    See details+

    What's Included

    Bimonthly home delivery and unlimited 24/7 access to MIT Technology Review’s website.

    The Download. Our daily newsletter of what's important in technology and innovation.

    Access to the Magazine archive. Over 24,000 articles going back to 1899 at your fingertips.

    Special Discounts to select partner offerings

    Discount to MIT Technology Review events

    Ad-free web experience

    First Look. Exclusive early access to stories.

    Insider Conversations. Join in and ask questions as our editors talk to innovators from around the world.

  • Insider Plus {! insider.prices.plus !}* Best Value

    {! insider.display.menuOptionsLabel !}

    Everything included in Insider Basic, plus ad-free web experience, select discounts to partner offerings and MIT Technology Review events

    See details+

    What's Included

    Bimonthly home delivery and unlimited 24/7 access to MIT Technology Review’s website.

    The Download. Our daily newsletter of what's important in technology and innovation.

    Access to the Magazine archive. Over 24,000 articles going back to 1899 at your fingertips.

    Special Discounts to select partner offerings

    Discount to MIT Technology Review events

    Ad-free web experience

  • Insider Basic {! insider.prices.basic !}*

    {! insider.display.menuOptionsLabel !}

    Six issues of our award winning magazine and daily delivery of The Download, our newsletter of what’s important in technology and innovation.

    See details+

    What's Included

    Bimonthly home delivery and unlimited 24/7 access to MIT Technology Review’s website.

    The Download. Our daily newsletter of what's important in technology and innovation.

You've read of free articles this month.