Load Balancing in MapReduce Based on Data Locality

With explosive growth in data size at era of information, MapReduce - a programing mode, which can process data in parallel, has been widely used. However, the original system gradually exposes some shortcomings. For example, handling skewed data can cause the imbalance of the system loads. After mapper processes data, the result will be sent to reducer by partition function. An inappropriate partition algorithm may result in poor network quality, the overloading of some reducers and the extension of the execution time of job. In summary, using an inappropriate algorithm to process skewed data will form a negative impact on the system performance. In order to solve load imbalance problem and improve performance of cluster, we plan to design an effective partition algorithm to guide the process of assigning data. Therefore, we develop an algorithm named CLP - Cluster Locality Partition, this algorithm consists of three parts: Preprocess part, Data-Cluster part and Locality-Partition part. The experimental results illustrate that the algorithm proposed in this paper is better than the default partition algorithm in the aspects of execution time and load balancing.

[1]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[2]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[4]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[5]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[6]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[7]  Mohammad Hammoud,et al.  Locality-Aware Reduce Task Scheduling for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[10]  Amin Vahdat,et al.  Scale-Out Networking in the Data Center , 2010, IEEE Micro.

[11]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[12]  Amin Vahdat,et al.  PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[13]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[14]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[15]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[16]  Liang Chen,et al.  Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.