Four Types of Data Skew and Their Effect on Parallel Join Performance

Recent work on parallel joins and data skew has concentrated on algorithm design without considering the causes and characteristics of data skew itself. This paper presents a simple analytic model of data skew and identifies four distinct types: tuple population skew, selectivity skew, hash partition skew and join probability skew. To demonstrate the model, a representative algorithm, the GRACE parallel join algorithm, is analyzed. Results of the analysis indicate that skew effects are substantial, and that they vary greatly with the type of skew. Also, skew effects vary substantially with system and data characteristics such as communications speed, cardinality and selectivity.