论文信息 - A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game

A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game

Abstract In many MapReduce applications, there is an unbalanced distribution of intermediate map-outputs to the reducers. The partitioner determines the load on the reducers. The completion time for a MapReduce job is determined as the slowest reduce task. Under normal conditions assigning a huge amount of data to a task will increase the time required for completion. The current study presents an adaptive algorithm called LAHP (learning automata hash partitioner) that is based on a learning automata game for custom distribution of intermediate key-value pairs to reducers. In this algorithm, a learning automaton on every mapper node is set to control the load on the reducers. This leads to a learning automata game during the execution of a job. This algorithm can partition the intermediate key-value pairs arbitrarily regardless of the statistical distribution of input data and pre-processing. Using the Bonett-test at a confidence level of 95%, the standard deviation ratio of hash-to-LAHP was [0.1, 2858]. This means that LAHP showed much lower dispersion. The results show that the proposed algorithm can successfully distribute any custom load to reducers with an accuracy of over 99% and can speed up the execution of popular applications more than four-fold.

Amir Masoud Rahmani | Saeed Setayeshi | Mohammad Amin Irandoost | S. Setayeshi | A. Rahmani

[1] Francisco Herrera,et al. On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[2] Jeffrey D. Ullman,et al. SharesSkew: An Algorithm to Handle Skew for Joins in MapReduce , 2015, Inf. Syst..

[3] Keqiu Li,et al. Sampling-Based Partitioning in MapReduce for Skewed Data , 2012, 2012 Seventh ChinaGrid Annual Conference.

[4] Nikolaus Augsten,et al. Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[5] Weiwei Xing,et al. NPIY : A novel partitioner for improving mapreduce performance , 2018, J. Vis. Lang. Comput..

[6] Wu Weiguo,et al. Improving MapReduce performance by balancing skewed loads , 2014, China Communications.

[7] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[8] P. S. Sastry,et al. Varieties of learning automata: an overview , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[9] C. L. Philip Chen,et al. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[10] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11] Magdalena Balazinska,et al. SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[12] Kaddour Najim,et al. Learning Automata: Theory and Applications , 1994 .

[13] Scott Shenker,et al. Synchronous and Asynchronous Learning by Responsive Learning Automata , 2010 .

[14] Hamid Haj Seyyed Javadi,et al. Load balancing in join algorithms for skewed data in MapReduce systems , 2018, The Journal of Supercomputing.

[15] M. Thathachar,et al. Networks of Learning Automata: Techniques for Online Stochastic Optimization , 2003 .

[16] B. R. Harita,et al. Learning automata with changing number of actions , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[17] A. Kivity,et al. kvm : the Linux Virtual Machine Monitor , 2007 .

[18] Jimmy J. Lin,et al. The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[19] Patrick Valduriez,et al. FP-Hadoop: Efficient processing of skewed MapReduce jobs , 2016, Inf. Syst..

[20] Xiaomin Zhu,et al. SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming , 2017, Future Gener. Comput. Syst..

[21] Juha Heinanen,et al. OF DATA INTENSIVE APPLICATIONS , 1986 .

[22] Hamid Haj Seyyed Javadi,et al. Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling , 2018, The Journal of Supercomputing.

[23] Wei Chen,et al. Map-Balance-Reduce: An improved parallel programming model for load balancing of MapReduce , 2017, Future Gener. Comput. Syst..

[24] Yan Zhang,et al. A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection , 2016, DASFAA Workshops.

[25] Kenli Li,et al. An intermediate data placement algorithm for load balancing in Spark computing environment , 2018, Future Gener. Comput. Syst..

[26] Zhen Xiao,et al. LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[27] Hai Jin,et al. Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..

[28] Suili Feng,et al. Joint antenna selection and robust beamforming design in multi-cell Distributed Antenna System , 2014 .

[29] Joanna Berlinska,et al. Comparing load-balancing algorithms for MapReduce under Zipfian data skews , 2018, Parallel Comput..

[30] Zhiyang Li,et al. Balancing reducer workload for skewed data using sampling-based partitioning , 2014, Comput. Electr. Eng..

[31] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[32] Raghu Ramakrishnan,et al. Sailfish: a framework for large scale data processing , 2012, SoCC '12.

[33] Douglas G. Bonett,et al. Robust Confidence Interval for a Ratio of Standard Deviations , 2006 .

[34] Xiao Qin,et al. $k$ NN-DP: Handling Data Skewness in $kNN$ Joins Using MapReduce , 2018, IEEE Transactions on Parallel and Distributed Systems.

[35] Hai Jin,et al. LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[36] Xiao Zhang,et al. MrHeter: improving MapReduce performance in heterogeneous environments , 2016, Cluster Computing.

[37] M. Balazinska,et al. A Study of Skew in MapReduce Applications , 2011 .

[38] María S. Pérez-Hernández,et al. Failure detector abstractions for MapReduce-based systems , 2017, Inf. Sci..

[39] Sang-goo Lee,et al. Handling data skew in join algorithms using MapReduce , 2016, Expert Syst. Appl..

[40] Ching-Hsien Hsu,et al. SmartJoin: a network-aware multiway join for MapReduce , 2014, Cluster Computing.

[41] Kumpati S. Narendra,et al. Learning automata - an introduction , 1989 .

[42] Gilles Fedak,et al. Availability/Network-aware MapReduce over the Internet , 2017, Inf. Sci..

[43] Hamid Pirahesh,et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[44] Garret Swart,et al. Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.