Learning automata-based algorithms for MapReduce data skewness handling

Abstract One of the most successful techniques for large-scale data processing is MapReduce. However, the performance of this technique is significantly reduced when there is skewness in data. The hash function is the default partitioner in Big Data frameworks such as Hadoop and Spark. Hash works perfectly when there is no data skewness, which is not the case in natural events. In this paper, we proposed two new algorithms, namely learning automata partitioner (LAP) and traffic cost-aware partitioner (TCAP) based on learning automata for handling reducer-side data skewness in MapReduce applications. LAP is based on clusters combination and performs well when data skewness degree is low. TCAP, on the other hand, has the advantage of considering network topology and balancing network traffic cost in the shuffling phase. TCAP supports cluster splitting and performs well in any data skewness degree. LAP and TCAP can also be used in heterogeneous environments. The performance of our algorithms was evaluated by several experiments and simulations by well-known benchmarks. The results confirmed that our algorithms performed better than other similar algorithms in most cases.

[1]  Xiao Zhang,et al.  MrHeter: improving MapReduce performance in heterogeneous environments , 2016, Cluster Computing.

[2]  Bin Cong,et al.  Scalable Parallel Computing: Technology, Architecture, Programming , 1999, Parallel Distributed Comput. Pract..

[3]  S. G. Nawaz,et al.  On Traffic-Aware Partition and Aggregation in Mapreduce for Big Data Applications , 2018 .

[4]  Xiao Qin,et al.  $k$ NN-DP: Handling Data Skewness in $kNN$ Joins Using MapReduce , 2018, IEEE Transactions on Parallel and Distributed Systems.

[5]  D. Janaki Ram,et al.  Chisel++: handling partitioning skew in MapReduce framework using efficient range partitioning technique , 2014, DIDC '14.

[6]  D. Janaki Ram,et al.  Chisel: A Resource Savvy Approach for Handling Skew in MapReduce Applications , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[7]  Albert Y. Zomaya,et al.  CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems , 2015, Future Gener. Comput. Syst..

[8]  B. R. Harita,et al.  Learning automata with changing number of actions , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[11]  Javad Akbari Torkestani,et al.  A learning approach to the bandwidth multicolouring problem , 2016, J. Exp. Theor. Artif. Intell..

[12]  R. Baskaran,et al.  AEGEUS: An online partition skew mitigation algorithm for mapreduce , 2016, ICIA.

[13]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[14]  Luciana Arantes,et al.  MRSG - A MapReduce simulator over SimGrid , 2013, Parallel Comput..

[15]  Weiwei Xing,et al.  NPIY : A novel partitioner for improving mapreduce performance , 2018, J. Vis. Lang. Comput..

[16]  Amir Masoud Rahmani,et al.  MapReduce Data Skewness Handling: A Systematic Literature Review , 2019, International Journal of Parallel Programming.

[17]  Fei Hu,et al.  SASM: Improving spark performance with Adaptive Skew Mitigation , 2015, 2015 IEEE International Conference on Progress in Informatics and Computing (PIC).

[18]  Wei Chen,et al.  Map-Balance-Reduce: An improved parallel programming model for load balancing of MapReduce , 2017, Future Gener. Comput. Syst..

[19]  Yan Zhang,et al.  A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection , 2016, DASFAA Workshops.

[20]  Patrick Valduriez,et al.  FP-Hadoop: Efficient processing of skewed MapReduce jobs , 2016, Inf. Syst..

[21]  Zhiyang Li,et al.  Balancing reducer workload for skewed data using sampling-based partitioning , 2014, Comput. Electr. Eng..

[22]  Yuan Xue,et al.  Scalable and robust key group size estimation for reducer load balancing in MapReduce , 2013, 2013 IEEE International Conference on Big Data.

[23]  Amir Masoud Rahmani,et al.  A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game , 2019, Inf. Sci..

[24]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[25]  Funda Ergün,et al.  Online load balancing for MapReduce with skewed data input , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[26]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[27]  Hamid Haj Seyyed Javadi,et al.  Load balancing in join algorithms for skewed data in MapReduce systems , 2018, The Journal of Supercomputing.

[28]  Mohammad Reza Meybodi,et al.  Maximal throughput scheduling based on the physical interference model using learning automata , 2016, Ad Hoc Networks.

[29]  Ce-Kuen Shieh,et al.  Smart Partitioning Mechanism for Dealing with Intermediate Data Skew in Reduce Task on Cloud Computing , 2017, 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA).

[30]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[31]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[32]  M. Balazinska,et al.  An analysis of Hadoop usage in scientific workloads , 2013 .

[33]  Jeffrey D. Ullman,et al.  SharesSkew: An Algorithm to Handle Skew for Joins in MapReduce , 2015, Inf. Syst..

[34]  Keqiu Li,et al.  Sampling-Based Partitioning in MapReduce for Skewed Data , 2012, 2012 Seventh ChinaGrid Annual Conference.

[35]  Hamid Haj Seyyed Javadi,et al.  Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling , 2018, The Journal of Supercomputing.

[36]  Xiaomin Zhu,et al.  SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming , 2017, Future Gener. Comput. Syst..

[37]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[38]  Hai Jin,et al.  Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..

[39]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[40]  Mohamed Faten Zhani,et al.  DREAMS: Dynamic resource allocation for MapReduce with data skew , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[41]  Garret Swart,et al.  Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.

[42]  Rajkumar Buyya,et al.  High Performance Mass Storage and Parallel I/O: Technologies and Applications , 2001 .

[43]  Javad Akbari Torkestani A new approach to the job scheduling problem in computational grids , 2011, Cluster Computing.

[44]  Kenli Li,et al.  An intermediate data placement algorithm for load balancing in Spark computing environment , 2018, Future Gener. Comput. Syst..