A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game

Abstract In many MapReduce applications, there is an unbalanced distribution of intermediate map-outputs to the reducers. The partitioner determines the load on the reducers. The completion time for a MapReduce job is determined as the slowest reduce task. Under normal conditions assigning a huge amount of data to a task will increase the time required for completion. The current study presents an adaptive algorithm called LAHP (learning automata hash partitioner) that is based on a learning automata game for custom distribution of intermediate key-value pairs to reducers. In this algorithm, a learning automaton on every mapper node is set to control the load on the reducers. This leads to a learning automata game during the execution of a job. This algorithm can partition the intermediate key-value pairs arbitrarily regardless of the statistical distribution of input data and pre-processing. Using the Bonett-test at a confidence level of 95%, the standard deviation ratio of hash-to-LAHP was [0.1, 2858]. This means that LAHP showed much lower dispersion. The results show that the proposed algorithm can successfully distribute any custom load to reducers with an accuracy of over 99% and can speed up the execution of popular applications more than four-fold.

[1]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[2]  Jeffrey D. Ullman,et al.  SharesSkew: An Algorithm to Handle Skew for Joins in MapReduce , 2015, Inf. Syst..

[3]  Keqiu Li,et al.  Sampling-Based Partitioning in MapReduce for Skewed Data , 2012, 2012 Seventh ChinaGrid Annual Conference.

[4]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[5]  Weiwei Xing,et al.  NPIY : A novel partitioner for improving mapreduce performance , 2018, J. Vis. Lang. Comput..

[6]  Wu Weiguo,et al.  Improving MapReduce performance by balancing skewed loads , 2014, China Communications.

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  P. S. Sastry,et al.  Varieties of learning automata: an overview , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[9]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[12]  Kaddour Najim,et al.  Learning Automata: Theory and Applications , 1994 .

[13]  Scott Shenker,et al.  Synchronous and Asynchronous Learning by Responsive Learning Automata , 2010 .

[14]  Hamid Haj Seyyed Javadi,et al.  Load balancing in join algorithms for skewed data in MapReduce systems , 2018, The Journal of Supercomputing.

[15]  M. Thathachar,et al.  Networks of Learning Automata: Techniques for Online Stochastic Optimization , 2003 .

[16]  B. R. Harita,et al.  Learning automata with changing number of actions , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[17]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[18]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[19]  Patrick Valduriez,et al.  FP-Hadoop: Efficient processing of skewed MapReduce jobs , 2016, Inf. Syst..

[20]  Xiaomin Zhu,et al.  SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming , 2017, Future Gener. Comput. Syst..

[21]  Juha Heinanen,et al.  OF DATA INTENSIVE APPLICATIONS , 1986 .

[22]  Hamid Haj Seyyed Javadi,et al.  Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling , 2018, The Journal of Supercomputing.

[23]  Wei Chen,et al.  Map-Balance-Reduce: An improved parallel programming model for load balancing of MapReduce , 2017, Future Gener. Comput. Syst..

[24]  Yan Zhang,et al.  A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection , 2016, DASFAA Workshops.

[25]  Kenli Li,et al.  An intermediate data placement algorithm for load balancing in Spark computing environment , 2018, Future Gener. Comput. Syst..

[26]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[27]  Hai Jin,et al.  Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..

[28]  Suili Feng,et al.  Joint antenna selection and robust beamforming design in multi-cell Distributed Antenna System , 2014 .

[29]  Joanna Berlinska,et al.  Comparing load-balancing algorithms for MapReduce under Zipfian data skews , 2018, Parallel Comput..

[30]  Zhiyang Li,et al.  Balancing reducer workload for skewed data using sampling-based partitioning , 2014, Comput. Electr. Eng..

[31]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[32]  Raghu Ramakrishnan,et al.  Sailfish: a framework for large scale data processing , 2012, SoCC '12.

[33]  Douglas G. Bonett,et al.  Robust Confidence Interval for a Ratio of Standard Deviations , 2006 .

[34]  Xiao Qin,et al.  $k$ NN-DP: Handling Data Skewness in $kNN$ Joins Using MapReduce , 2018, IEEE Transactions on Parallel and Distributed Systems.

[35]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[36]  Xiao Zhang,et al.  MrHeter: improving MapReduce performance in heterogeneous environments , 2016, Cluster Computing.

[37]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[38]  María S. Pérez-Hernández,et al.  Failure detector abstractions for MapReduce-based systems , 2017, Inf. Sci..

[39]  Sang-goo Lee,et al.  Handling data skew in join algorithms using MapReduce , 2016, Expert Syst. Appl..

[40]  Ching-Hsien Hsu,et al.  SmartJoin: a network-aware multiway join for MapReduce , 2014, Cluster Computing.

[41]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[42]  Gilles Fedak,et al.  Availability/Network-aware MapReduce over the Internet , 2017, Inf. Sci..

[43]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[44]  Garret Swart,et al.  Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.