The Hidden Risk of a Meltdown in the Cloud

The cloud could suffer the same kind of collapses that plague the financial system, according to an analysis of the unrecognised risks of cloud computing.

Emerging Technology from the arXivarchive page

March 13, 2012

The cloud is essentially a metaphor for a network of computers in which computational tasks and resources can be shared.

The big idea here is that users simply rent the computing power, the storage or an application for as long as they need it without having to invest in the infrastructure behind it.

That makes computing cheaper, easier and more efficient.

There are well known problems of course. The most obvious relates to guaranteeing the security of data when it is stored on computers that that a user does not own and that many others can also access. But various solutions have emerged such as encrypting data before it is sent to the cloud. For that reason, the migration to the cloud is proceeding at full speed in many places.

That may be folly. Today, Bryan Ford at Yale University in New Haven says that the full risks of this migration have yet to be properly explored. He points out that complex systems can fail in many unexpected ways and outlines various simple scenarios in which a cloud could come unstuck.

In the worst case scenario, a cloud could experience a full meltdown that could seriously threaten any business that relies on it.

Ford identifies a number of different possibilities. One example involves an application provider who bases its services in the cloud, such as a cloud -based advertising service.

He imagines a simple scenario in which the cloud operator distributes the service between two virtual servers, using a power balancing program to switch the load from one server to the other as conditions demand.

However, the application provider may also have a load balancing program that distributes the customer load.

Now Ford imagines the scenario in which both load balancing programs operate with the same refresh period, say once a minute. When these periods coincide, the control loops start sending the load back and forth between the virtual servers in a positive feedback loop.

“The two controllers each compensate with a stronger action causing a larger swing the next minute,” says Ford. Clearly, this is a process that must eventually spiral out of control and crash the system.

Ford is careful to put the risk in context: “This simplistic example might be unlikely to occur in exactly this form on real systems—or might be quickly detected and “ﬁxed” during development and testing—but it suggests a general risk,” he says.

In fact, this kind of coupling between an application provider and the infrastructure provider is inevitable, particularly when the cloud provider’s system is opaque so that external users cannot see the internal control loop cycles and so avoid them,

“Non-transparent layering structures…may create unexpected and potentially catastrophic failure correlations, reminiscent of ﬁnancial industry crashes,” he says.

But the lack of transparency is only part of the story. A more general risk arises when systems are complex because seemingly unrelated parts can become coupled in unexpected ways.

A growing number of complexity theorists are beginning to recognise this problem. The growing consensus is that bizarre and unpredictable behaviour often emerges in systems made up of “networks of networks”.

An obvious example is the flash crashes that now plague many financial markets in which prices plummet dramatically for no apparent reason. Understanding how and why this happens is the focus of much research.

Given that cloud is clearly becoming a network of networks that is rapidly growing in complexity, it’s not hard to imagine that the computing equivalent of flash crashes are not just likely but inevitable.

Of course, it would be easy for cloud providers to say that their systems are carefully designed and monitored and entirely risk free in this respect. That would be an understandable knee jerk reaction from a PR department.

But it ought to be a worrying sign for any customer, indicating that the providers simply do not understand the problem, let alone have a solution for it.

Ford concludes with the following: “We should study [these unrecognised risks] before our socioeconomic fabric becomes inextricably dependent on a convenient but potentially unstable computing model.”

Clearly, an eminently sensible suggestion.

Ref: arxiv.org/abs/1203.1979: Icebergs in the Clouds: the Other Risks of Cloud Computing

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.