Faster Joins, Self-Joins and Multi-Way Joins Using Join Indices

We propose a new algorithm, called Stripe-join, for performing a join given a join index. Stripe-join is inspired by an algorithm called \Jive-join" developed by Li and Ross. Stripe-join makes a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a set of temporary les that contain tuple identiiers but no input tuples. Stripe-join performs this eeciently even when the input relations are much larger than main memory, as long as the number of blocks in main memory is of the order of the square root of the number of blocks in the participating relations. Stripe-join is particularly eecient for self-joins. To our knowledge, Stripe-join is the rst algorithm that, given a join index and a relation signiicantly larger than main memory, can perform a self-join with just a single pass over the input relation and without storing input tuples in intermediate les. Almost all the I/O is sequential, thus minimizing the impact of seek and rotational latency. The algorithm is resistant to data skew. It can also join multiple relations while still making only a single pass over each input relation. Using a detailed cost model, Stripe-join is analyzed and compared with competing algorithms. For large input relations, Stripe-join performs signiicantly better than Valduriez's algorithm and hash join algorithms. We demonstrate circumstances under which Stripe-join performs signiicantly better than Jive-join. Unlike Jive-join, Stripe-join makes no assumptions about the order of the join index.