论文信息 - Improving Hash Join Performance By Exploiting Intrinsic Data Skew

Improving Hash Join Performance By Exploiting Intrinsic Data Skew

Large relational databases are a part of all of our lives. The government uses them and almost any store you visit uses them to help process your purchases. Real-world data sets are not uniformly distributed and often contain significant skew. Skew is present in commercial databases where, for example, some items are purchased far more often than others. A relational database must be able to efficiently find related information that it stores. In large databases the most common method used to find related information is a hash join algorithm. Although mitigating the negative effects of skew on hash joins has been studied, no prior work has examined how the statistics present in modern database systems can allow skew to be exploited and used as an advantage to improve the performance of hash joins. This thesis presents Histojoin: a join algorithm that uses statistics to identify data skew and improve the performance of hash join operations. Experimental results show that for skewed data sets Histojoin performs significantly fewer I/O operations and is faster by 10 to 60% than standard hash join algorithms.

Bryce Cutt | B. Cutt

[1] Alfred G. Dale,et al. A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins , 1991, VLDB.

[2] Goetz Graefe. Five Performance Enhancements for Hybrid Hash Join , 1992 .

[3] Hidehiko Tanaka,et al. Application of hash to data base machine and its architecture , 1983, New Generation Computing.

[4] Michael Stonebraker,et al. Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[5] David J. DeWitt,et al. Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[6] Masaya Nakayama,et al. The Effect of Bucket Size Tuning in the Dynamic Hybrid GRACE Hash Join Method , 1989, VLDB.

[7] M. V. Wilkes,et al. The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[8] Donald E. Knuth,et al. The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[9] Donald Ervin Knuth,et al. The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information , 1978 .

[10] David J. DeWitt,et al. Practical Skew Handling in Parallel Joins , 1992, VLDB.

[11] Peter J. Haas,et al. Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.