Optimizing MapReduce Based on Locality of K-V Pairs and Overlap between Shuffle and Local Reduce

At present, MapReduce is the most popular programming model for Big Data processing. As a typical open source implementation of MapReduce, Hadoop is divided into map, shuffle, and reduce. In the mapping phase, according to the principle moving computation towards data, the load is basically balanced and network traffic is relatively small. However, shuffle is likely to result in the outburst of network communication. At the same time, reduce without considering data skew will lead to an imbalanced load, and then performance degradation. This paper proposes a Locality-Enhanced Load Balance (LELB) algorithm, and then extends the execution flow of MapReduce to Map, Local reduce, Shuffle and final Reduce (MLSR), and proposes a corresponding MLSR algorithm. Use of the novel algorithms can share the computation of reduce and overlap with shuffle in order to take full advantage of CPU and I/O resources. The actual test results demonstrate that the execution performance using the LELB algorithm and the MLSR algorithm outperforms the execution performance using hadoop by up to 9.2% (for Merge Sort) and 14.4% (for Word Count).

[1]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[2]  Indranil Gupta,et al.  Breaking the MapReduce stage barrier , 2010, 2010 IEEE International Conference on Cluster Computing.

[3]  Hai Jin,et al.  Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..

[4]  Patrick Valduriez,et al.  Data Partitioning for Minimizing Transferred Data in MapReduce , 2013, Globe.

[5]  Seyong Lee,et al.  MapReduce with communication overlap (MaRCO) , 2013, J. Parallel Distributed Comput..

[6]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[7]  Bo Gao,et al.  Improving the Load Balance of MapReduce Operations based on the Key Distribution of Pairs , 2014, ArXiv.

[8]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[9]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[10]  Xiaoqiao Meng,et al.  Performance analysis of Coupling Scheduler for MapReduce/Hadoop , 2012, 2012 Proceedings IEEE INFOCOM.

[11]  Minghong Lin,et al.  Joint optimization of overlapping phases in MapReduce , 2013, PERV.

[12]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Alekh Jindal,et al.  Hadoop++ , 2010 .