Technology Review

Computing

Failure Cascading Through the Cloud

Two major outages illustrate how complicated it is to keep a cloud system up and running.

  • Tuesday, May 3, 2011
  • By Erica Naone

Recently two major cloud computing services, Amazon's Elastic Compute Cloud and Sony's PlayStation Network, have suffered extended outages. Though the circumstances of each were different, details that the companies have released about their causes show how delicate complex cloud systems can be.

Cloud computing services have grown in popularity over the past few years; they're flexible, and often less expensive than owning physical systems and software. Amazon's service attracts business customers who want the power of a modern, distributed system without having to build and maintain the infrastructure themselves. The PlayStation Network offers an enhanced experience for gamers, such as multi-player gameplay or an easy way to find and download new titles. But the outages illustrate how customers are at the mercy of the cloud provider, both in terms of fixing the problem, and in terms of finding out what went wrong.

The Elastic Compute Cloud—one of Amazon's most popular Web services—was down from Thursday, April 21, to Sunday, April 24. Popular among startups, the service is used by Foursquare, Quora, Reddit, and others. Users can rent virtual computing resources and scale up or down as their needs fluctuate.

Amazon's outage was caused by a feature called Elastic Block Store, which provides a way to store data so that it works optimally with the Elastic Compute Cloud's virtual machines. Elastic Block Store is designed to protect data from being lost by automatically creating replicas of memory units, or "nodes" within Amazon's network.

Advertisement

The problem occurred when Amazon engineers attempting to upgrade the primary Elastic Block Store network accidentally routed some traffic onto a backup network that didn't have enough capacity. Though this individual mistake was small, it had far-reaching effects that were amplified by the systems put in place to protect data.

A large number of Elastic Block Store nodes lost their connection to the replicas they had created, causing them to immediately look for somewhere to create a new replica. The result was what Amazon calls "a re-mirroring storm" as the nodes created new replicas. The outage worsened as other nodes began to fail under the traffic onslaught, creating even more orphans hunting for storage space in which to create replicas.

Amazon's attempts to fix the problem were stymied by the need to avoid interference with other systems. For example, Elastic Block Store doesn't reuse failed nodes, since the engineers who built it assumed they would contain data that might need to be recovered.

Amazon says the problem has led to better understanding of its network. "We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures," the team responsible for fixing the network wrote in a statement.

Print

Related Articles

Cloud Security

Researchers have found a way to watch for spies in the cloud.

Security in the Ether

Information technology's next grand challenge will be to secure the cloud--and prove we can trust it.

A Silver Lining for the Government's Cloud

Cloud computing solutions might improve the overall security of government software.

Advertisement

MAGAZINE

People Power 2.0

How civilians helped win the Libyan information war.

Sponsored Content

Technologies from National Instruments

Triggering
Learn how to configure a start trigger on a USB data acquisition device

> Click here for more National Instruments Videos <
Whitepaper

How To Measure Voltage

Voltage is the difference of electrical potential between two points of an electrical or electronic circuit, expressed in volts. It measures the potential energy of an electric field to cause an electric current in an electrical conductor.

Most measurement devices can measure voltage. Two common voltage measurements are direct current (DC) and alternating current (AC).

Learn the fundamentals of creating an AC or DC voltage measurement system. See how to properly connect the signals to your data acquisition system for accurate acquisition.

This document is part of the How-To Guide for Most Common Measurements centralized resource portal.

View full PDF > Listen to story >
Find us on Youtube

Videos

Interview with George Dyson

More

Advertisement
Advertisement
Advertisement