Skip to Content

Failure Cascading Through the Cloud

Two major outages illustrate how complicated it is to keep a cloud system up and running.

Recently two major cloud computing services, Amazon’s Elastic Compute Cloud and Sony’s PlayStation Network, have suffered extended outages. Though the circumstances of each were different, details that the companies have released about their causes show how delicate complex cloud systems can be.

Cloud computing services have grown in popularity over the past few years; they’re flexible, and often less expensive than owning physical systems and software. Amazon’s service attracts business customers who want the power of a modern, distributed system without having to build and maintain the infrastructure themselves. The PlayStation Network offers an enhanced experience for gamers, such as multi-player gameplay or an easy way to find and download new titles. But the outages illustrate how customers are at the mercy of the cloud provider, both in terms of fixing the problem, and in terms of finding out what went wrong.

The Elastic Compute Cloud—one of Amazon’s most popular Web services—was down from Thursday, April 21, to Sunday, April 24. Popular among startups, the service is used by Foursquare, Quora, Reddit, and others. Users can rent virtual computing resources and scale up or down as their needs fluctuate.

Amazon’s outage was caused by a feature called Elastic Block Store, which provides a way to store data so that it works optimally with the Elastic Compute Cloud’s virtual machines. Elastic Block Store is designed to protect data from being lost by automatically creating replicas of memory units, or “nodes” within Amazon’s network.

The problem occurred when Amazon engineers attempting to upgrade the primary Elastic Block Store network accidentally routed some traffic onto a backup network that didn’t have enough capacity. Though this individual mistake was small, it had far-reaching effects that were amplified by the systems put in place to protect data.

A large number of Elastic Block Store nodes lost their connection to the replicas they had created, causing them to immediately look for somewhere to create a new replica. The result was what Amazon calls “a re-mirroring storm” as the nodes created new replicas. The outage worsened as other nodes began to fail under the traffic onslaught, creating even more orphans hunting for storage space in which to create replicas.

Amazon’s attempts to fix the problem were stymied by the need to avoid interference with other systems. For example, Elastic Block Store doesn’t reuse failed nodes, since the engineers who built it assumed they would contain data that might need to be recovered.

Amazon says the problem has led to better understanding of its network. “We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures,” the team responsible for fixing the network wrote in a statement.

However, some experts question whether this will really help prevent future outages. “It’s not just individual systems that can fail,” says Neil Conway, a PhD student at the University of California, Berkeley, who works on a research project involving large-scale and complex computing platforms. “One failure event can have all of these cascading effects.” A similar problem led to a temporary failure of Amazon’s Simple Storage Service in 2008.

One of the biggest challenges, Conway says, is that “testing is almost impossible, because by definition these are unusual situations.” He adds that it’s difficult to simulate the behavior of a system as large and complex as Amazon Web Services, or even to know what to simulate.

Conway expects companies and researchers to look into new ways of testing abnormal situations for cloud computing systems. “The severity of the outage and the time it took [Amazon] to recover will draw a lot of people’s attention,” he says.

Sony’s PlayStation Network, an online gaming platform linked to the PlayStation 3, has yet to be fully restored after its outage on April 20. The company took it down in response to a security breach and has been frantically reworking the system to keep it better protected in the future. In a press release, Sony offered some details of its progress to date. The company has added enhanced levels of data protection and encryption, additional firewalls, and better methods for detecting intrusions and unusual activity.

For both Sony and Amazon, these struggles are happening in public, under pressure, and under the scrutiny of millions. Systems as complex as cloud services are going to fail, and it’s impossible to anticipate all the conditions that could lead to trouble. But as cloud computing matures, companies will build more extensive testing, monitoring, and backup systems to prevent outages resulting in public embarrassment and financial loss.

Keep Reading

Most Popular

light and shadow on floor
light and shadow on floor

How Facebook and Google fund global misinformation

The tech giants are paying millions of dollars to the operators of clickbait pages, bankrolling the deterioration of information ecosystems around the world.

protein structures
protein structures

DeepMind says it will release the structure of every protein known to science

The company has already used its protein-folding AI, AlphaFold, to generate structures for the human proteome, as well as yeast, fruit flies, mice, and more.

ASML machine
ASML machine

Inside the machine that saved Moore’s Law

The Dutch firm ASML spent $9 billion and 17 years developing a way to keep making denser computer chips.

brain map
brain map

This is what happens when you see the face of someone you love

The moment we recognize someone, a lot happens all at once. We aren’t aware of any of it.

Stay connected

Illustration by Rose WongIllustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

Explore more newsletters

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at with a list of newsletters you’d like to receive.