Fast similarity join for multi-dimensional data

The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper, we develop two new in-memory spatial join algorithms, the Grid-join and EGO*-join, and study their performance. Through evaluation, we explore the domain of applicability of each approach and provide recommendations for the choice of a join algorithm depending upon the dimensionality of the data as well as the expected selectivity of the join. We show that the two new proposed join techniques substantially outperform the state-of-the-art join algorithm, the EGO-join.

[1]  Kenneth A. Ross,et al.  Making B+-Trees Cache Conscious in Main Memory , 2000, SIGMOD Conference.

[2]  Christian Böhm,et al.  A cost model and index architecture for the similarity join , 2001, Proceedings 17th International Conference on Data Engineering.

[3]  Sunil Prabhakar,et al.  Similarity join for low-and high-dimensional data , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[4]  Walid G. Aref,et al.  Efficient Evaluation of Continuous Range Queries on Moving Objects , 2002, DEXA.

[5]  Hans-Peter Kriegel,et al.  Using sets of feature vectors for similarity search on voxelized CAD objects , 2003, SIGMOD '03.

[6]  Ming-Ling Lo,et al.  Spatial hash-joins , 1996, SIGMOD '96.

[7]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[8]  Jennifer Widom,et al.  Proceedings of the 1996 ACM SIGMOD international conference on Management of data , 1996, PODS 1996.

[9]  David J. DeWitt,et al.  Partition based spatial-merge join , 1996, SIGMOD '96.

[10]  Christos Faloutsos,et al.  QBIC project: querying images by content, using color, texture, and shape , 1993, Electronic Imaging.

[11]  Jiawei Han,et al.  Discovery of Spatial Association Rules in Geographic Information Databases , 1995, SSD.

[12]  Julius T. Tou,et al.  Information Systems , 1973, GI Jahrestagung.

[13]  Özgür Ulusoy,et al.  A Quadtree-Based Dynamic Attribute Indexing Method , 1998, Comput. J..

[14]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[15]  Susanne E. Hambrusch,et al.  Main Memory Evaluation of Monitoring Queries Over Moving Objects , 2004, Distributed and Parallel Databases.

[16]  Elke A. Rundensteiner,et al.  Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations , 1997, VLDB.

[17]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[18]  Walid G. Aref,et al.  Query Indexing and Velocity Constrained Indexing: Scalable Techniques for Continuous Queries on Moving Objects , 2002, IEEE Trans. Computers.

[19]  Anastasios Kementsietsidis,et al.  Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data , 2001, SIGMOD 2011.

[20]  Kyuseok Shim,et al.  High-dimensional similarity joins , 1997, Proceedings 13th International Conference on Data Engineering.

[21]  Kenneth A. Ross,et al.  Making B+- trees cache conscious in main memory , 2000, SIGMOD '00.

[22]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[23]  Thomas S. Huang,et al.  Supporting content-based queries over images in MARS , 1997, Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[24]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[25]  Kihong Kim,et al.  Optimizing multidimensional index trees for main memory access , 2001, SIGMOD '01.

[26]  Rakesh Agrawal,et al.  Parallel Algorithms for High-dimensional Similarity Joins for Data Mining Applications , 1997, Very Large Data Bases Conference.

[27]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[28]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[29]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.