On the performance and energy efficiency of Hadoop deployment models

The exponential growth of scientific and business data has resulted in the evolution of the cloud computing and the MapReduce parallel programming model. Cloud computing emphasizes increased utilization and power savings through consolidation while MapReduce enables large scale data analysis. The Hadoop framework has recently evolved to the standard framework implementing the MapReduce model. In this paper, we evaluate Hadoop performance in both the traditional model of collocated data and compute services as well as consider the impact of separating out the services. The separation of data and compute services provides more flexibility in environments where data locality might not have a considerable impact such as virtualized environments and clusters with advanced networks. In this paper, we also conduct an energy efficiency evaluation of Hadoop on physical and virtual clusters in different configurations. Our extensive evaluation shows that: (1) performance on physical clusters is significantly better than on virtual clusters; (2) performance degradation due to separation of the services depends on the data to compute ratio; (3) application completion progress correlates with the power consumption and power consumption is heavily application specific.

[1]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[2]  Lavanya Ramakrishnan,et al.  Evaluating Hadoop for Data-Intensive Scientific Operations , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[3]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[4]  Jignesh M. Patel,et al.  Energy management for MapReduce clusters , 2010, Proc. VLDB Endow..

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[7]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[8]  Rong Ge,et al.  Improving MapReduce energy efficiency for computation intensive workloads , 2011, 2011 International Green Computing Conference and Workshops.

[9]  Christine Morin,et al.  Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[10]  Yanpei Chen,et al.  Energy efficiency for large-scale MapReduce workloads with significant interactive analysis , 2012, EuroSys '12.

[11]  Tajana Simunic,et al.  vGreen: a system for energy efficient computing in virtualized environments , 2009, ISLPED.

[12]  Aravind Menon,et al.  Big data @ facebook , 2012 .

[13]  Jordi Torres,et al.  GreenHadoop: leveraging green energy in data-processing frameworks , 2012, EuroSys '12.

[14]  Alexandra Carpen-Amarie,et al.  MapReduce Applications in the Cloud: A Cost Evaluation of Computation and Storage , 2012, Globe.

[15]  Devarshi Ghoshal,et al.  I/O performance of virtualized cloud environments , 2011, DataCloud-SC '11.

[16]  Hai Jin,et al.  Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.

[17]  Gernot Heiser,et al.  Dynamic voltage and frequency scaling: the laws of diminishing returns , 2010 .

[18]  Franck Cappello,et al.  Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[19]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.