When data are uniformly distributed, parallel hash-based join algorithm scales up well. However, the presence of data skew can cause load imbalance among the processors, significantly deteriorating its performance. In this paper we propose a dynamic skew handling algorithm which deals with this load imbalance, by detecting and handling join product skews at run-time. The idea is to monitor the join processing at the join phase and compare the average processing rate of each partition with the rate statically predicted at the scheduling phase. If their difference is detected to be large enough to produce a significant performance degradation, the processor is considered to be overloaded and a workload compensation strategy is dynamically invoked. In this case, based on the measured average processing rate, the amount of overload caused by the unpredicted join product skew is calculated and, the amount of load to be migrated to the non-overloaded processors is determined. We propose two methods the result redistribution and the processing task migration to handle the load migration from the overloaded processor to the non-overloaded processors. Simulation results show that our dynamic skew handling approach can detect and handle load imbalances efficiently, so that the rebalance of load among the processors results in an almost constant join execution time under different join product skews.
[1]
Michael Stonebraker,et al.
Implementation techniques for main memory database systems
,
1984,
SIGMOD '84.
[2]
Philip S. Yu,et al.
An effective algorithm for parallelizing hash joins in the presence of data skew
,
1991,
[1991] Proceedings. Seventh International Conference on Data Engineering.
[3]
Ronald L. Graham,et al.
Bounds on Multiprocessing Timing Anomalies
,
1969,
SIAM Journal of Applied Mathematics.
[4]
Jeffrey F. Naughton,et al.
Using shared virtual memory for parallel join processing
,
1993,
SIGMOD '93.
[5]
Kien A. Hua,et al.
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning
,
1991,
VLDB.
[6]
David J. DeWitt,et al.
Practical Skew Handling in Parallel Joins
,
1992,
VLDB.
[7]
Masaru Kitsuregawa,et al.
Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC)
,
1990,
VLDB.
[8]
Alfred G. Dale,et al.
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins
,
1991,
VLDB.