Handling data skew in join algorithms using MapReduce

We introduce a skew handling algorithm, called multi-dimensional range partitioning.The proposed algorithm is more efficient than traditional MapReduce-based join algorithms.The proposed algorithm is scalable regardless of the size of input data. One of the major obstacles hindering effective join processing on MapReduce is data skew. Since MapReduce's basic hash-based partitioning method cannot solve the problem properly, two alternatives have been proposed: range-based and randomized methods. However, they still remain some drawbacks: the range-based method does not handle join product skew, and the randomized method performs worse than the basic hash-based partitioning when input relations are not skewed. In this paper, we present a new skew handling method, called multi-dimensional range partitioning (MDRP). The proposed method overcomes the limitations of traditional algorithms in two ways: 1) the number of output records expected at each machine is considered, which leads to better handling of join product skew, and 2) a small number of input records are sampled before the actual join begins so that an efficient execution plan considering the degree of data skew can be created. As a result, in a scalar skew experiment, the proposed join algorithm is about 6.76 times faster than the range-based algorithm when join product skew exists and about 5.14 times than the randomized algorithm when input relations are not skewed. Moreover, through the worst-case analysis, we show that the input and the output imbalances are less than or equal to 2. The proposed algorithm does not require any modification to the original MapReduce environment and is applicable to complex join operations such as theta-joins and multi-way joins.

[1]  Kenneth A. Ross,et al.  Track join: distributed joins with minimal network traffic , 2014, SIGMOD Conference.

[2]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[3]  Zhiyang Li,et al.  Balancing reducer workload for skewed data using sampling-based partitioning , 2014, Comput. Electr. Eng..

[4]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..

[5]  Stratis D. Viglas,et al.  SAND Join — A skew handling join algorithm for Google's MapReduce framework , 2011, 2011 IEEE 14th International Multitopic Conference.

[6]  Mostafa Bamha,et al.  Handling Data-skew Effects in Join Operations Using MapReduce , 2014, ICCS.

[7]  Kien A. Hua,et al.  Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning , 1991, VLDB.

[8]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[9]  Christoph Koch,et al.  Load balancing and skew resilience for parallel joins , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[10]  Yu Cao,et al.  Cost-Based Join Algorithm Selection in Hadoop , 2014, WISE.

[11]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[12]  H. Buchner The Grid File : An Adaptable , Symmetric Multikey File Structure , 2001 .

[13]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[14]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[15]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[16]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[17]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[20]  Masaru Kitsuregawa,et al.  Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC) , 1990, VLDB.

[21]  Alfred G. Dale,et al.  A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins , 1991, VLDB.

[22]  Ching-Hsien Hsu,et al.  SmartJoin: a network-aware multiway join for MapReduce , 2014, Cluster Computing.

[23]  Jaideep Vaidya,et al.  Algorithms and Architectures for Parallel Processing , 2018, Lecture Notes in Computer Science.

[24]  Masaru Kitsuregawa,et al.  Dynamic Join Product Skew Handling for Hash-Joins in Shared-Nothing Database Systems , 1995, DASFAA.

[25]  C. J. Hahn,et al.  Extended Edited Synoptic Cloud Reports from Ships and Land Stations Over the Globe, 1952-1996 , 1999 .

[26]  Jeffrey F. Naughton,et al.  Using shared virtual memory for parallel join processing , 1993, SIGMOD '93.

[27]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[28]  Nicolas Bruno,et al.  Advanced Join Strategies for Large-Scale Distributed Computation , 2014, Proc. VLDB Endow..

[29]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[30]  Hongjun Lu,et al.  Load Balanced Join Processing in Shared-Noting Systems , 1994, J. Parallel Distributed Comput..

[31]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[32]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[33]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.