Adaptive parallel hash join in main-memory databases

Presents an algorithm for parallel hash-join computation on main-memory databases that adapts to data skew, and its implementation on the IBM RP3 multiprocessor. The algorithm exploits the random access capabilities of main memory databases to detect and counteract skew on the fly. Data skew is detected at run time by monitoring the observed frequencies of values of the join attribute and applying to them a threshold function that takes account of the distribution of workload among processors. If and when this threshold is reached for certain values of the join attribute, the computation corresponding to it is fragmented among an appropriate number of processors. Fragmentation requires some replication of input tuples-modestly increasing the total workload, but reduces the completion time significantly by reducing workload at the overloaded processor. A simplified analysis is supplemented by experiments. The description and analysis of the algorithm are based on the shared-nothing model. The implementation uses hierarchical shared memory providing non-uniform memory access.<<ETX>>