MapReduce Data Skewness Handling: A Systematic Literature Review

One of the most successful techniques in large-scale data-intensive computations is MapReduce programming. MapReduce is based on a divide and conquer approach that uses commodity computers, also known as nodes, for parallel processing. The scalability and performance of this technique are more related to the type of data distribution in map and reduce tasks. Because of many reasons such as node failure, network failure, data skewness, etc. completion time of one task could be longer than other tasks, job completion time is determined by the slowest task. One of the most important reasons for requiring more time to finish one task compared to other tasks is the skewness of data. Despite the widespread use of MapReduce because of its high flexibility and tolerability of the error, with the presence of data skewness, it cannot fully utilize the nodes for parallel processing. The objectives of this study were to review related articles and classify them based on the type of problem addressed and to determine the advantages and disadvantages of them. Open issues were also defined to present guidelines for future research on this subject. In order to achieve the aforementioned objectives, some research questions were defined and answered. In this review, it was concluded that there are important parameters have not been considered in MapReduce data skewness handling approaches.

[1]  Patrick Valduriez,et al.  FP-Hadoop: Efficient processing of skewed MapReduce jobs , 2016, Inf. Syst..

[2]  Fei Hu,et al.  SASM: Improving spark performance with Adaptive Skew Mitigation , 2015, 2015 IEEE International Conference on Progress in Informatics and Computing (PIC).

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Raouf Boutaba,et al.  ROUTE: run‐time robust reducer workload estimation for MapReduce , 2016, Int. J. Netw. Manag..

[5]  Yuan Xue,et al.  Scalable and robust key group size estimation for reducer load balancing in MapReduce , 2013, 2013 IEEE International Conference on Big Data.

[6]  Amir Masoud Rahmani,et al.  A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game , 2019, Inf. Sci..

[7]  Ramachandran Baskaran,et al.  AEGEUS++: an energy-aware online partition skew mitigation algorithm for mapreduce in cloud , 2017, Cluster Computing.

[8]  Wei Chen,et al.  Map-Balance-Reduce: An improved parallel programming model for load balancing of MapReduce , 2017, Future Gener. Comput. Syst..

[9]  Yan Zhang,et al.  A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection , 2016, DASFAA Workshops.

[10]  Xiaomin Zhu,et al.  SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming , 2017, Future Gener. Comput. Syst..

[11]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[12]  Amir Masoud Rahmani,et al.  Internet of Things applications: A systematic review , 2019, Comput. Networks.

[13]  Garret Swart,et al.  Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.

[14]  Funda Ergün,et al.  Online load balancing for MapReduce with skewed data input , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[15]  Yasushi Sakurai,et al.  Database Systems for Advanced Applications , 2016, Lecture Notes in Computer Science.

[16]  Vishal Ankush Nawale,et al.  Minimizing Skew in MapReduce Applications Using Node Clustering in Heterogeneous Environment , 2015, 2015 International Conference on Computational Intelligence and Communication Networks (CICN).

[17]  R. Baskaran,et al.  AEGEUS: An online partition skew mitigation algorithm for mapreduce , 2016, ICIA.

[18]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[19]  Hai Jin,et al.  Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..

[20]  Rajkumar Buyya,et al.  High Performance Mass Storage and Parallel I/O: Technologies and Applications , 2001 .

[21]  Réjean Landry,et al.  Lessons from Innovation Empirical Studies in the Manufacturing Sector: A Systematic Review of the Literature from 1993-2003 , 2006 .

[22]  Nima Jafari Navimipour,et al.  Formal verification approaches and standards in the cloud computing: A comprehensive and systematic review , 2018, Comput. Stand. Interfaces.

[23]  D. Janaki Ram,et al.  Chisel: A Resource Savvy Approach for Handling Skew in MapReduce Applications , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[24]  Geoffrey C. Fox,et al.  Automatic Task Re-organization in MapReduce , 2011, 2011 IEEE International Conference on Cluster Computing.

[25]  Albert Y. Zomaya,et al.  CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems , 2015, Future Gener. Comput. Syst..

[26]  Ce-Kuen Shieh,et al.  Smart Partitioning Mechanism for Dealing with Intermediate Data Skew in Reduce Task on Cloud Computing , 2017, 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA).

[27]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[28]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[29]  Zhiyang Li,et al.  Balancing reducer workload for skewed data using sampling-based partitioning , 2014, Comput. Electr. Eng..

[30]  Weiwei Xing,et al.  MRSIM: Mitigating Reducer Skew In MapReduce , 2017, 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA).

[31]  Wu Weiguo,et al.  Improving MapReduce performance by balancing skewed loads , 2014, China Communications.

[32]  Raouf Boutaba,et al.  OPTIMA: On-Line Partitioning Skew Mitigation for MapReduce with Resource Adjustment , 2016, Journal of Network and Systems Management.

[33]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[34]  Xiao Zhang,et al.  MrHeter: improving MapReduce performance in heterogeneous environments , 2016, Cluster Computing.

[35]  Shanshan Li,et al.  SkewControl: Gini Out of the Bottle , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[36]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[37]  M. Balazinska,et al.  An analysis of Hadoop usage in scientific workloads , 2013 .

[38]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[39]  Suili Feng,et al.  Joint antenna selection and robust beamforming design in multi-cell Distributed Antenna System , 2014 .

[40]  Mohammad Javad Kargar,et al.  Load balancing in MapReduce on homogeneous and heterogeneous clusters: an in-depth review , 2015, Int. J. Commun. Networks Distributed Syst..

[41]  Ching-Hsien Hsu,et al.  An improved partitioning mechanism for optimizing massive data analysis using MapReduce , 2013, The Journal of Supercomputing.

[42]  Christopher Olston,et al.  SpongeFiles: mitigating data skew in mapreduce using distributed memory , 2014, SIGMOD Conference.

[43]  Sofiène Tahar,et al.  Task Scheduling in Big Data Platforms: A Systematic Literature Review , 2017, J. Syst. Softw..

[44]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[45]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[46]  Keqiu Li,et al.  Sampling-Based Partitioning in MapReduce for Skewed Data , 2012, 2012 Seventh ChinaGrid Annual Conference.

[47]  Harvey Maylor,et al.  Now, let's make it really complex (complicated): A systematic review of the complexities of projects , 2011 .

[48]  Haibo Hu,et al.  MapReduce Parallel Programming Model: A State-of-the-Art Survey , 2015, International Journal of Parallel Programming.

[49]  Ching-Hsien Hsu,et al.  An Adaptive and Memory Efficient Sampling Mechanism for Partitioning in MapReduce , 2015, International Journal of Parallel Programming.

[50]  Raghu Ramakrishnan,et al.  Sailfish: a framework for large scale data processing , 2012, SoCC '12.

[51]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[52]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[53]  Bin Cong,et al.  Scalable Parallel Computing: Technology, Architecture, Programming , 1999, Parallel Distributed Comput. Pract..

[54]  D. Janaki Ram,et al.  Chisel++: handling partitioning skew in MapReduce framework using efficient range partitioning technique , 2014, DIDC '14.

[55]  María S. Pérez-Hernández,et al.  Fault Tolerance in MapReduce: A Survey , 2016, Resource Management for Big Data Platforms.

[56]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[57]  Amir Masoud Rahmani,et al.  Cloud computing service negotiation: A systematic review , 2018, Comput. Stand. Interfaces.

[58]  Mohamed Faten Zhani,et al.  DREAMS: Dynamic resource allocation for MapReduce with data skew , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[59]  Kenli Li,et al.  An intermediate data placement algorithm for load balancing in Spark computing environment , 2018, Future Gener. Comput. Syst..

[60]  Mika Mäntylä,et al.  Using metrics in Agile and Lean Software Development - A systematic literature review of industrial studies , 2015, Inf. Softw. Technol..