Improving Hash Join Performance By Exploiting Intrinsic Data Skew

Large relational databases are a part of all of our lives. The government uses them and almost any store you visit uses them to help process your purchases. Real-world data sets are not uniformly distributed and often contain significant skew. Skew is present in commercial databases where, for example, some items are purchased far more often than others. A relational database must be able to efficiently find related information that it stores. In large databases the most common method used to find related information is a hash join algorithm. Although mitigating the negative effects of skew on hash joins has been studied, no prior work has examined how the statistics present in modern database systems can allow skew to be exploited and used as an advantage to improve the performance of hash joins. This thesis presents Histojoin: a join algorithm that uses statistics to identify data skew and improve the performance of hash join operations. Experimental results show that for skewed data sets Histojoin performs significantly fewer I/O operations and is faster by 10 to 60% than standard hash join algorithms.

[1]  Alfred G. Dale,et al.  A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins , 1991, VLDB.

[2]  Goetz Graefe Five Performance Enhancements for Hybrid Hash Join , 1992 .

[3]  Hidehiko Tanaka,et al.  Application of hash to data base machine and its architecture , 1983, New Generation Computing.

[4]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[5]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[6]  Masaya Nakayama,et al.  The Effect of Bucket Size Tuning in the Dynamic Hybrid GRACE Hash Join Method , 1989, VLDB.

[7]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[8]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[9]  Donald Ervin Knuth,et al.  The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information , 1978 .

[10]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[11]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[12]  Ramon Lawrence,et al.  Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results , 2005, VLDB.

[13]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[14]  Masaya Nakayama,et al.  Hash-Partitioned Join Method Using Dynamic Destaging Strategy , 1988, VLDB.

[15]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[16]  Wei Li,et al.  Skew handling techniques in sort-merge join , 2002, SIGMOD '02.

[17]  Ramon Lawrence,et al.  Using intrinsic data skew to improve hash join performance , 2009, Inf. Syst..

[18]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[19]  David J. DeWitt,et al.  The Wisconsin Benchmark: Past, Present, and Future , 1991, The Benchmark Handbook.

[20]  Surajit Chaudhuri,et al.  Exploiting statistics on query expressions for optimization , 2002, SIGMOD '02.

[21]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .