Practical Skew Handling in Parallel Joins

We present an approach to dealing with skew in parallel joins in database systems. Our approach is easily implementable within current parallel DBMS, and performs well on skewed data without degrading the performance of the system on non-skewed data. The main idea is to use multiple algorithms, each specialized for a different degree of skew, and to use a small sample of the relations being joined to determine which algorithm is appropriate. We developed, implemented, and experimented with four new skew-handling parallel join algorithms; one, which we call virtual processor range partitioning, was the clear winner in high skew cases, while traditional hybrid hash join was the clear winner in lower skew or no skew cases. We present experimental results from an implementation of all four algorithms on the Gamma parallel database machine. To our knowledge, these are the first reported skew-handling numbers from an actual implementation.

[1]  Doron Rotem,et al.  Random Sampling from B+ Trees , 1989, VLDB.

[2]  Chaitanya K. Baru,et al.  Join on a Cube: Analysis, Simulation, and Implementation , 1987, IWDM.

[3]  Masaru Kitsuregawa,et al.  Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC) , 1990, VLDB.

[4]  Alfred G. Dale,et al.  A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins , 1991, VLDB.

[5]  Jim Gray,et al.  The convoy phenomenon , 1979, OPSR.

[6]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[7]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[8]  David J. DeWitt,et al.  A performance analysis of the gamma database machine , 1988, SIGMOD '88.

[9]  William G. Cochran,et al.  Sampling Techniques, 3rd Edition , 1963 .

[10]  Jeffrey F. Naughton,et al.  Sampling Issues in Parallel Database Systems , 1992, EDBT.

[11]  Ping Xu,et al.  Random sampling from hash files , 1990, SIGMOD '90.

[12]  David J. DeWitt,et al.  Multiprocessor Hash-Based Join Algorithms , 1985, VLDB.

[13]  David J. DeWitt,et al.  Parallel Database Systems: The Future of High Performance Database Processing 1 , 1992 .

[14]  Kjell Bratbergsengen Algebra Operations on a Parallel Computer - Performance Evaluation , 1987, IWDM.

[15]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[16]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[17]  Philip S. Yu,et al.  Effectiveness of Parallel Joins , 1990, IEEE Trans. Knowl. Data Eng..

[18]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[19]  Philip S. Yu,et al.  An effective algorithm for parallelizing hash joins in the presence of data skew , 1991, [1991] Proceedings. Seventh International Conference on Data Engineering.

[20]  Edward Omiecinski,et al.  Performance Analysis of a Load Balancing Hash-Join Algorithm for a Shared Memory Multiprocessor , 1991, VLDB.

[21]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[22]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[23]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[24]  David J. DeWitt,et al.  Design and implementation of the wisconsin storage system , 1985, Softw. Pract. Exp..

[25]  Kien A. Hua,et al.  Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning , 1991, VLDB.

[26]  David J. DeWitt,et al.  A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment , 1989, SIGMOD '89.

[27]  David J. DeWitt,et al.  Parallel sorting on a shared-nothing architecture using probabilistic splitting , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.