Flexible partitioning for selective binary theta-joins in a massively parallel setting

Efficient join processing plays an important role in big data analysis. In this work, we focus on generic theta joins in a massively parallel environment, such as MapReduce and Spark. Theta joins are notoriously slow due to their inherent quadratic complexity, even when their selectivity is low, e.g., 1%. The main performance bottleneck differs between cases, and is due to any of the following factors or their combination: amount of data being shuffled, memory load on reducers, or computation load on reducers. We propose an ensemble-based partitioning approach that tackles all three aspects. In this way, we can save communication cost, we better respect the memory and computation limitations of reducers and overall, we reduce the total execution time. The key idea behind our partitioning is to cluster join key values following two techniques, namely matrix re-arrangement and agglomerative clustering. These techniques can run either in isolation or in combination. We present thorough experimental results using both band queries on real data and arbitrary synthetic predicates. We show that we can save up to 45% of the communication cost and reduce the computation load of a single reducer up to 50% in band queries, whereas the savings are up to 74 and 80%, respectively, in queries with arbitrary theta predicates. Apart from being effective, the potential benefits of our approach can be estimated before execution from metadata, which allows for informed partitioning decisions. Finally, our solutions are flexible in that they can account for any weighted combination of the three bottleneck factors.

[1]  Jing Li,et al.  Optimizing Theta-Joins in a MapReduce Environment , 2013 .

[2]  D. A. Milner,et al.  Direct clustering algorithm for group formation in cellular manufacture , 1982 .

[3]  Christoph Koch,et al.  Scalable and Adaptive Online Joins , 2014, Proc. VLDB Endow..

[4]  Jeffrey D. Ullman,et al.  Matching bounds for the all-pairs MapReduce problem , 2013, IDEAS '13.

[5]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[6]  Jan Karel Lenstra,et al.  Some Simple Applications of the Travelling Salesman Problem , 1975 .

[7]  Yufei Tao,et al.  Minimal MapReduce algorithms , 2013, SIGMOD '13.

[8]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[9]  Jan Karel Lenstra,et al.  Technical Note - Clustering a Data Array and the Traveling-Salesman Problem , 1974, Oper. Res..

[10]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[11]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..

[12]  Christoph Koch,et al.  Load balancing and skew resilience for parallel joins , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[13]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[14]  Carsten Binnig,et al.  An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[15]  Anastasios Gounaris,et al.  Binary Theta-Joins using MapReduce: Efficiency Analysis and Improvements , 2014, EDBT/ICDT Workshops.

[16]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[17]  Paolo Papotti,et al.  Lightning Fast and Space Efficient Inequality Joins , 2015, Proc. VLDB Endow..

[18]  Paul J. Schweitzer,et al.  Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[19]  DoulkeridisChristos,et al.  A survey of large-scale analytical query processing in MapReduce , 2014, VLDB 2014.

[20]  Weixiong Zhang,et al.  Rearrangement Clustering: Pitfalls, Remedies, and Applications , 2006, J. Mach. Learn. Res..

[21]  Jordi Torres,et al.  Spark deployment and performance evaluation on the MareNostrum supercomputer , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[22]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[23]  Hong Zhu,et al.  Two MRJs for Multi-way Theta-Join in MapReduce , 2013, IDCS.

[24]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[25]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[26]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[27]  Ihab F. Ilyas,et al.  Distributed Data Deduplication , 2016, Proc. VLDB Endow..

[28]  Mirek Riedewald,et al.  Anti-combining for MapReduce , 2014, SIGMOD Conference.

[29]  Shih-Ying Chen,et al.  An Efficient Theta-Join Query Processing Algorithm on MapReduce Framework , 2012, 2012 International Symposium on Computer, Consumer and Control.

[30]  Magdalena Balazinska,et al.  Hadoop's Adolescence , 2013, Proc. VLDB Endow..

[31]  J. King Machine-component grouping in production flow analysis: an approach using a rank order clustering algorithm , 1980 .