Hello,

We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.

Business Report

Service Blackouts Threaten Cloud Users

When Amazon’s computers went dark last April, companies realized that there are no guarantees in the cloud.

  • by Cindy Waxer
  • October 19, 2011
  • Damage control: Internet discussion about the service outage that struck Amazon Web Services in April spiked as soon as problems began (April 21st) and again when Amazon explained the cause (April 29th). The data is based on selected mentions on Twitter, blogs, and in online media.

For all the unprecedented scalability and convenience of cloud computing, there’s one way it falls short: reliability.

Just ask Jeff Malek, cofounder of BigDoor, a Seattle company whose game software is hosted on the public servers of Amazon. Last April, problems in a Northern Virginia data center crippled Amazon’s northeast operations, affecting many cloud-based businesses. Spotty service over four days left BigDoor scrambling to find technical solutions and issuing a steady stream of apologies to its 250 clients.

Since then, BigDoor has joined a growing number of companies that are seeking new ways of building outage-resistant systems in the cloud, often at additional expense and inconvenience.

Major cloud providers such as Salesforce.com, Microsoft, and Google have all experienced outages. During a 30-day period in August and September, for instance, Google’s cloud-based apps experienced six service disruptions, according to the company’s Apps Status Dashboard, including one hour-long outage on September 7 that shut out several million users of Google Docs. Some outages have been caused by unpredictable events like lightning strikes, but software updates are frequently the culprit. Google said changes to its software caused Google Docs to go down, and an attempt to run updated code was also blamed for a massive e-mail outage affecting users of BlackBerry phones in Europe this month. (RIM, like Google, operates its own servers instead of relying on public cloud providers.)

Even though outages put businesses at immense risk, public cloud providers still don’t offer ironclad guarantees. In its so-called “service-level agreement,” Amazon says that if its services are unavailable for more than 0.05 percent of a year (around four hours) it will give the clients a credit “equal to 10% of their bill.” Some in the industry believe public clouds like Amazon should aim for 99.999 percent availability, or downtime of only around five minutes a year.

Until then, businesses must cope with the reality of imperfect service. “For me, [service agreements] are created by bureaucrats and lawyers,” says Malek. “What I care about is how dependable the cloud service is, and what a provider has done to prepare for outages.”

Amazon has computing data centers in five regions around the world. Companies can opt to run applications in multiple regions as a precautionary measure—the same way you’d back up your computer to an external hard drive. Immediately after the big outage in April, BigDoor upgraded to host its services in multiple Amazon regions. However, adding that layer of protection led to a 5 percent increase in the company’s monthly cloud bills. And getting applications to communicate seamlessly across isolated regions requires “extra resources, money, and time,” says Malek.

Netflix, which uses Amazon’s cloud to stream videos, is another company that is putting engineering effort into protecting against cloud failures. Following the problems at Amazon, the company developed Chaos Gorilla, internal software that allows its engineers simulate how effectively Netflix’s systems can reroute data when a network zone goes down.

One company that says it wasn’t affected by recent cloud problems is SimpleGeo, a San Francisco startup that uses Amazon to serve location-awareness tools to its clients. SimpleGeo emerged unscathed from the Amazon outage thanks to its use of a technique called back-pressure routing, which avoids network areas suffering slow response times. Cofounder and chief technology officer Joe Stump compares the technology to a driver’s adjusting speed or switching lanes on a clogged highway. Although SimpleGeo may have dodged network outages, doing so came at a substantial cost. “We have three engineers dedicated full time to building internal tools to manage our infrastructure,” he says.

Indeed, Stump says only one thing is 100 percent certain when it comes to the cloud: “You always have to architect your systems under an assumption of failure.”

Become an MIT Technology Review Insider for in-depth analysis and unparalleled perspective.

Subscribe today

Uh oh–you've read all of your free articles for this month.

Insider Premium
$179.95/yr US PRICE

More from Business Impact
Business in the Cloud

How technology advances are changing the economy and providing new opportunities in many industries.

Want more award-winning journalism? Subscribe to Insider Plus.
  • Insider Plus {! insider.prices.plus !}*

    {! insider.display.menuOptionsLabel !}

    Everything included in Insider Basic, plus the digital magazine, extensive archive, ad-free web experience, and discounts to partner offerings and MIT Technology Review events.

    See details+

    What's Included

    Unlimited 24/7 access to MIT Technology Review’s website

    The Download: our daily newsletter of what's important in technology and innovation

    Bimonthly print magazine (6 issues per year)

    Bimonthly digital/PDF edition

    Access to the magazine PDF archive—thousands of articles going back to 1899 at your fingertips

    Special interest publications

    Discount to MIT Technology Review events

    Special discounts to select partner offerings

    Ad-free web experience

/
You've read all of your free articles this month. This is your last free article this month. You've read of free articles this month. or  for unlimited online access.