In-Memory Spatial Join: The Data Matters!

A spatial join computes all pairs of spatial objects in two data sets satisfying a distance constraint. An increasing demand in applications ranging from human brain analysis to transportation data analysis motivates studies on designing new in-memory spatial join algorithms. Among recent proposals, the following six algorithms can efficiently perform in-memory spatial joins: Size Separation Spatial Join (S3), Spatial Grid Hash join (SGrid), TOUCH, Partition Based Spatial-Merge Join (PBSM), Plane-Sweep Join (PS), and Nested-Loop Join (NL). This paper addresses the need for studies of aspects that may influence the performance of spatial join algorithms. In particular, given two datasets, A and B, the following aspects may affect performance: the datasets being real or synthetic data, the distributions with respect to density and location of the datasets, and the order of performing the spatial join (A 1 B or B 1 A). To study the effects on performance of these aspects, we implement the six spatial join algorithms in a single framework and conduct extensive experiments. The findings show that the data being real or synthetic, the data distribution, and the join order can influence substantially the performance of the algorithms. We present detailed findings that offer insight into different facets of each algorithm and that enable comparison across algorithms and datasets. Furthermore, we provide advice on choosing among the spatial join algorithms based on the empirical evaluation.