Business Report

Service Blackouts Threaten Cloud Users

When Amazon’s computers went dark last April, companies realized that there are no guarantees in the cloud.

  • by Cindy Waxer
  • October 19, 2011
  • Damage control: Internet discussion about the service outage that struck Amazon Web Services in April spiked as soon as problems began (April 21st) and again when Amazon explained the cause (April 29th). The data is based on selected mentions on Twitter, blogs, and in online media.

For all the unprecedented scalability and convenience of cloud computing, there’s one way it falls short: reliability.

Just ask Jeff Malek, cofounder of BigDoor, a Seattle company whose game software is hosted on the public servers of Amazon. Last April, problems in a Northern Virginia data center crippled Amazon’s northeast operations, affecting many cloud-based businesses. Spotty service over four days left BigDoor scrambling to find technical solutions and issuing a steady stream of apologies to its 250 clients.

Since then, BigDoor has joined a growing number of companies that are seeking new ways of building outage-resistant systems in the cloud, often at additional expense and inconvenience.

Major cloud providers such as, Microsoft, and Google have all experienced outages. During a 30-day period in August and September, for instance, Google’s cloud-based apps experienced six service disruptions, according to the company’s Apps Status Dashboard, including one hour-long outage on September 7 that shut out several million users of Google Docs. Some outages have been caused by unpredictable events like lightning strikes, but software updates are frequently the culprit. Google said changes to its software caused Google Docs to go down, and an attempt to run updated code was also blamed for a massive e-mail outage affecting users of BlackBerry phones in Europe this month. (RIM, like Google, operates its own servers instead of relying on public cloud providers.)

Even though outages put businesses at immense risk, public cloud providers still don’t offer ironclad guarantees. In its so-called “service-level agreement,” Amazon says that if its services are unavailable for more than 0.05 percent of a year (around four hours) it will give the clients a credit “equal to 10% of their bill.” Some in the industry believe public clouds like Amazon should aim for 99.999 percent availability, or downtime of only around five minutes a year.

Until then, businesses must cope with the reality of imperfect service. “For me, [service agreements] are created by bureaucrats and lawyers,” says Malek. “What I care about is how dependable the cloud service is, and what a provider has done to prepare for outages.”

Amazon has computing data centers in five regions around the world. Companies can opt to run applications in multiple regions as a precautionary measure—the same way you’d back up your computer to an external hard drive. Immediately after the big outage in April, BigDoor upgraded to host its services in multiple Amazon regions. However, adding that layer of protection led to a 5 percent increase in the company’s monthly cloud bills. And getting applications to communicate seamlessly across isolated regions requires “extra resources, money, and time,” says Malek.

Netflix, which uses Amazon’s cloud to stream videos, is another company that is putting engineering effort into protecting against cloud failures. Following the problems at Amazon, the company developed Chaos Gorilla, internal software that allows its engineers simulate how effectively Netflix’s systems can reroute data when a network zone goes down.

One company that says it wasn’t affected by recent cloud problems is SimpleGeo, a San Francisco startup that uses Amazon to serve location-awareness tools to its clients. SimpleGeo emerged unscathed from the Amazon outage thanks to its use of a technique called back-pressure routing, which avoids network areas suffering slow response times. Cofounder and chief technology officer Joe Stump compares the technology to a driver’s adjusting speed or switching lanes on a clogged highway. Although SimpleGeo may have dodged network outages, doing so came at a substantial cost. “We have three engineers dedicated full time to building internal tools to manage our infrastructure,” he says.

Indeed, Stump says only one thing is 100 percent certain when it comes to the cloud: “You always have to architect your systems under an assumption of failure.”

Get stories like this before anyone else with First Look.

Subscribe today
Already a Premium subscriber? Log in.

Uh oh–you've read all of your free articles for this month.

Insider Premium
$179.95/yr US PRICE

More from Business Impact
Business in the Cloud

How technology advances are changing the economy and providing new opportunities in many industries.

Want more award-winning journalism? Subscribe to Insider Basic.
  • Insider Basic {! insider.prices.basic !}*

    {! insider.display.menuOptionsLabel !}

    Six issues of our award winning magazine and daily delivery of The Download, our newsletter of what’s important in technology and innovation.

    See details+

    What's Included

    Bimonthly home delivery and unlimited 24/7 access to MIT Technology Review’s website.

    The Download. Our daily newsletter of what's important in technology and innovation.

You've read all of your free articles this month. This is your last free article this month. You've read of free articles this month. or  for unlimited online access.