Similarity join for low-and high-dimensional data

The efficient processing of similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focussed on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper we develop two new spatial join algorithms, the Grid-join and EGO-join, and study their performance in comparison to the state of the art algorithm EGO-join and the RSJ algorithm. Through evaluation we explore the domain of applicability of each algorithm and provide recommendations for the choice of join algorithm depending upon the dimensionality of the data as well as the critical /spl epsiv/ parameter. We also point out the significance of the choice of this parameter for ensuring that the selectivity achieved is reasonable.

[1]  Christian Böhm,et al.  A cost model and index architecture for the similarity join , 2001, Proceedings 17th International Conference on Data Engineering.

[2]  Özgür Ulusoy,et al.  A Quadtree-Based Dynamic Attribute Indexing Method , 1998, Comput. J..

[3]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[4]  David J. DeWitt,et al.  Partition based spatial-merge join , 1996, SIGMOD '96.

[5]  Kenneth A. Ross,et al.  Making B+- trees cache conscious in main memory , 2000, SIGMOD '00.

[6]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[7]  Elke A. Rundensteiner,et al.  Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations , 1997, VLDB.

[8]  Walid G. Aref,et al.  Query Indexing and Velocity Constrained Indexing: Scalable Techniques for Continuous Queries on Moving Objects , 2002, IEEE Trans. Computers.

[9]  Ming-Ling Lo,et al.  Spatial hash-joins , 1996, SIGMOD '96.

[10]  Kyuseok Shim,et al.  High-Dimensional Similarity Joins , 2002, IEEE Trans. Knowl. Data Eng..

[11]  Nick Koudas,et al.  High dimensional similarity joins: algorithms and performance evaluation , 1998, Proceedings 14th International Conference on Data Engineering.

[12]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[13]  Jiawei Han,et al.  Discovery of Spatial Association Rules in Geographic Information Databases , 1995, SSD.

[14]  Kenneth A. Ross,et al.  Making B+-Trees Cache Conscious in Main Memory , 2000, SIGMOD Conference.

[15]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[16]  Kihong Kim,et al.  Optimizing multidimensional index trees for main memory access , 2001, SIGMOD '01.

[17]  Rakesh Agrawal,et al.  Parallel Algorithms for High-dimensional Similarity Joins for Data Mining Applications , 1997, Very Large Data Bases Conference.

[18]  Walid G. Aref,et al.  Efficient Evaluation of Continuous Range Queries on Moving Objects , 2002, DEXA.

[19]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.