Skew handling techniques in sort-merge join

Joins are among the most frequently executed operations. Several fast join algorithms have been developed and extensively studied; these can be categorized as sort-merge, hash-based, and index-based algorithms. While all three types of algorithms exhibit excellent performance over most data, ameliorating the performance degradation in the presence of skew has been investigated only for hash-based algorithms. However, for sort-merge join, even a small amount of skew present in realistic data can result in a significant performance hit on a commercial DBMS. This paper examines the negative ramifications of skew in sort-merge join and proposes several refinements that deal effectively with data skew. Experiments show that some of these algorithms also impose virtually no penalty in the absence of data skew and are thus suitable for replacing existing sort-merge implementations. We also show how sort-merge band join performance is significantly enhanced with these refinements.

[1]  Alfred G. Dale,et al.  A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins , 1991, VLDB.

[2]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[3]  Christian S. Jensen,et al.  Efficient evaluation of the valid-time natural join , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[4]  R. Snodgrass,et al.  Skew Handling Techniques in Sort-merge Join a Timecenter Technical Report Skew Handling Techniques in Sort-merge Join , 2001 .

[5]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[6]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[7]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[8]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[9]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[10]  Richard R. Muntz,et al.  Generalized data stream indexing and temporal query processing , 1992, [1992 Proceedings] Second International Workshop on Research Issues on Data Engineering: Transaction and Query Processing.

[11]  Arie Segev,et al.  A Framework for Query Optimization in Temporal Databases , 1990, SSDBM.

[12]  Masaya Nakayama,et al.  Hash-Partitioned Join Method Using Dynamic Destaging Strategy , 1988, VLDB.

[13]  Goetz Graefe,et al.  Sort-merge-join: an idea whose time has(h) passed? , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[14]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[15]  Masaya Nakayama,et al.  The Effect of Bucket Size Tuning in the Dynamic Hybrid GRACE Hash Join Method , 1989, VLDB.

[16]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[17]  Arie Segev,et al.  Event-Join Optimization in Temporal Relational Databases , 1989, VLDB.

[18]  Kien A. Hua,et al.  Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning , 1991, VLDB.

[19]  Leonard D. Shapiro,et al.  Sort vs . Hash Revisited , 2004 .

[20]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[21]  Jignesh M. Patel,et al.  Structural joins: a primitive for efficient XML query pattern matching , 2002, Proceedings 18th International Conference on Data Engineering.