Evaluating MapReduce on Virtual Machines: The Hadoop Case

MapReduceis emerging as an important programming model for large scale parallel application. Meanwhile, Hadoop is an open source implementation of MapReduce enjoying wide popularity for developing data intensive applications in the cloud. As, in the cloud, the computing unit is virtual machine (VM) based; it is feasible to demonstrate the applicability of MapReduce on virtualized data center. Although the potential for poor performance and heavy load no doubt exists, virtual machines can instead be used to fully utilize the system resources, ease the management of such systems, improve the reliability, and save the power. In this paper, a series of experiments are conducted to measure and analyze the performance of Hadoop on VMs. Our experiments are used as a basis for outlining several issues that will need to be considered when implementing MapReduce to fit completely in the cloud.

[1]  Renato J. O. Figueiredo,et al.  Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources , 2007, Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing (VTDC '07).

[2]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[3]  Orran Krieger,et al.  Virtualization for high-performance computing , 2006, OPSR.

[4]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[5]  Dhabaleswar K. Panda,et al.  A case for high performance computing with virtual machines , 2006, ICS '06.

[6]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[7]  Hai Jin,et al.  CLOUDLET: towards mapreduce implementation on virtual machines , 2009, HPDC '09.

[8]  Randal E. Bryant,et al.  Data-Intensive Supercomputing: The case for DISC , 2007 .

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[11]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[12]  Renato J. O. Figueiredo,et al.  A case for grid computing on virtual machines , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[13]  Alexander S. Szalay,et al.  The importance of data locality in distributed computing applications , 2006 .