SLDP: A Novel Data Placement Strategy for Large-Scale Heterogeneous Hadoop Cluster

Hadoop as a popular open-source implementation of MapReduce is widely used for large scale data-intensive applications like data mining, web indexing and scientific computing. The current Hadoop implementation assumes that nodes in a cluster are homogeneous in nature, and Hadoop distributed file system(HDFS) distributes data to multiple nodes based on disk space availability. Such data placement strategy is very efficient for homogeneous environments, where nodes are identical in terms of both computing power and disk capacity. Unfortunately, in practice, the homogeneity assumptions do not always hold. Hadoop's scheduler will lead to severe performance degradation and energy dissipation in heterogeneous environments by using default data placement strategy of HDFS. In this paper, we propose a novel snakelike data placement mechanism (SLDP) for large-scale heterogeneous Hadoop cluster. SLDP adopts a heterogeneity aware algorithm to divide various nodes into several virtual storage tiers(VST) firstly, and then places data blocks across nodes in each VST circuitously according to the hotness of data. Furthermore, SLDP uses a hotness proportional replication to reduce disk space consumption and also has an effective power control function. Experimental results on two real data-intensive applications show that SLDP is energy-efficient, space-saving and able to improve MapReduce performance in a heterogeneous Hadoop cluster significantly.

[1]  P. Lipari,et al.  First Result from the Alpha Magnetic Spectrometer on the International Space Station: Precision Measurement of the Positron Fraction in Primary Cosmic Rays of 0.5350 GeV , 2013 .

[2]  Geoffrey C. Fox,et al.  Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[3]  I. J. Myung,et al.  Tutorial on maximum likelihood estimation , 2003 .

[4]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[5]  Ibrahim F. Haddad,et al.  PVFS: A Parallel Virtual File System for Linux Clusters , 2000 .

[6]  Rini T. Kaushik,et al.  GreenHDFS: towards an energy-conserving, storage-efficient, hybrid Hadoop compute cluster , 2010 .

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Vincent Salzgeber,et al.  Making cluster applications energy-aware , 2009, ACDC '09.

[9]  Miroslav Ciric,et al.  Fuzzy equivalence relations and their equivalence classes , 2007, Fuzzy Sets Syst..

[10]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[12]  Madhusudhan Govindaraju,et al.  MARLA: MapReduce for Heterogeneous Clusters , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[13]  Yuhong Feng,et al.  An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments , 2011, 2011 International Conference on Cloud and Service Computing.

[14]  Vasudeva Varma,et al.  Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework , 2012, Future Gener. Comput. Syst..

[15]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[16]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[17]  Xian-He Sun,et al.  ADAPT: Availability-Aware MapReduce Data Placement for Non-dedicated Distributed Computing , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[20]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[21]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.