Data Partitioning for Minimizing Transferred Data in MapReduce

Reducing data transfer in MapReduce’s shuffle phase is very important because it increases data locality of reduce tasks, and thus decreases the overhead of job executions. In the literature, several optimizations have been proposed to reduce data transfer between mappers and reducers. Nevertheless, all these approaches are limited by how intermediate key-value pairs are distributed over map outputs. In this paper, we address the problem of high data transfers in MapReduce, and propose a technique that repartitions tuples of the input datasets, and thereby optimizes the distribution of key-values over mappers, and increases the data locality in reduce tasks. Our approach captures the relationships between input tuples and intermediate keys by monitoring the execution of a set of MapReduce jobs which are representative of the workload. Then, based on those relationships, it assigns input tuples to the appropriate chunks. We evaluated our approach through experimentation in a Hadoop deployment on top of Grid5000 using standard benchmarks. The results show high reduction in data transfer during the shuffle phase compared to Native Hadoop.

[1]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[2]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[3]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[4]  Shashi Shekhar,et al.  Partitioning Similarity Graphs: A Framework for Declustering Problems , 1996, Inf. Syst..

[5]  Jin-Soo Kim,et al.  HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[6]  Tamara G. Kolda,et al.  Graph partitioning models for parallel computing , 2000, Parallel Comput..

[7]  Mohammad Hammoud,et al.  Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[8]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).