Dynamic Join Product Skew Handling for Hash-Joins in Shared-Nothing Database Systems

When data are uniformly distributed, parallel hash-based join algorithm scales up well. However, the presence of data skew can cause load imbalance among the processors, significantly deteriorating its performance. In this paper we propose a dynamic skew handling algorithm which deals with this load imbalance, by detecting and handling join product skews at run-time. The idea is to monitor the join processing at the join phase and compare the average processing rate of each partition with the rate statically predicted at the scheduling phase. If their difference is detected to be large enough to produce a significant performance degradation, the processor is considered to be overloaded and a workload compensation strategy is dynamically invoked. In this case, based on the measured average processing rate, the amount of overload caused by the unpredicted join product skew is calculated and, the amount of load to be migrated to the non-overloaded processors is determined. We propose two methods the result redistribution and the processing task migration to handle the load migration from the overloaded processor to the non-overloaded processors. Simulation results show that our dynamic skew handling approach can detect and handle load imbalances efficiently, so that the rebalance of load among the processors results in an almost constant join execution time under different join product skews.