Faster Cloud Computing
Reliable software can now handle data on the fly rather than in batches
Source: “MapReduce Online”
Tyson Condie et al.
Proceedings of the seventh USENIX Symposium on Network Design and Implementation, April 28-30, 2010, San Jose, CA
Results: Researchers have modified Hadoop MapReduce, a software platform designed to reliably process large amounts of data on a cluster of computers (as is necessary in cloud computing). The changes decreased by several orders of magnitude the time the software takes to process data, without sacrificing the reliability the technology is known for.
Why it matters: The earlier version of Hadoop was too slow to handle applications that require real-time responsiveness, such as providing near-instant updates about the traffic or sales transactions on a website. The new version could expand the range of applications that can run on distributed computers. It could also make applications that are run in the cloud more reliable by allowing managers to catch abnormal behavior as soon as it happens.
Methods: The researchers reduced the time it took Hadoop MapReduce to complete jobs by adapting a technique called pipelining. Ordinarily, Hadoop waits until one task is complete before it will start a second; that makes it easier to handle the failure of a computer in the cluster. With pipelining, data can be sent and processed continuously before the first task is complete.
Next steps: One of the researchers would like to develop the system further so that it can be used to customize Web-page layouts in real time in response to user behavior.