Solving similarity joins and range queries in metric spaces with the list of twin clusters

The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We solve two variants of the similarity join problem: (1) range joins: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r; and (2) k-closest pair joins: Find the k closest object pairs (one from each set). For this sake, we devise a new metric index, coined List of Twin Clusters (LTC), which indexes both sets jointly, instead of the natural approach of indexing one or both sets independently. Finally, we show how to use the LTC in order to solve classical range queries. Our results show significant speedups over the basic quadratic-time naive alternative for both join variants, and that the LTC is competitive with the original list of clusters when solving range queries. Furthermore, we show that our technique has a great potential for improvements.

[1]  Gonzalo Navarro,et al.  Metric Spaces Library , 2008 .

[2]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[3]  Mauricio Marín,et al.  Distributed Sparse Spatial Selection Indexes , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[4]  Margarida Mamede,et al.  Recursive Lists of Clusters: A Dynamic Data Structure for Range Queries in Metric Spaces , 2005, ISCIS.

[5]  Mauricio Marín,et al.  A Search Engine Index for Multimedia Content , 2008, Euro-Par.

[6]  Gonzalo Navarro,et al.  On the Least Cost for Proximity Searching in Metric Spaces , 2006, WEA.

[7]  Iraj Kalantari,et al.  A Data Structure and an Algorithm for the Nearest Point Problem , 1983, IEEE Transactions on Software Engineering.

[8]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[9]  Mario A. López,et al.  Finding k-Closest-Pairs Efficiently for High Dimensional Data , 2000, CCCG.

[10]  Gonzalo Navarro,et al.  Dynamic spatial approximation trees , 2008, JEAL.

[11]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[12]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[13]  Pavel Zezula,et al.  Similarity Join in Metric Spaces Using eD-Index , 2003, DEXA.

[14]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[16]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[17]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[18]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[19]  Mauricio Marín,et al.  Efficient Parallelization of Spatial Approximation Trees , 2005, International Conference on Computational Science.

[20]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[21]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[22]  F. DEHNE,et al.  Voronoi trees and clustering problems , 1987, Inf. Syst..

[23]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[24]  Sunil Prabhakar,et al.  Similarity join for low-and high-dimensional data , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[25]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[26]  Pavel Zezula,et al.  Similarity Join in Metric Spaces , 2003, ECIR.

[27]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[28]  Peter Yianilos,et al.  Excluded middle vantage point forests for nearest neighbor search , 1998 .

[29]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[30]  Pavel Zezula,et al.  D-Index: Distance Searching Index for Metric Data Sets , 2003, Multimedia Tools and Applications.

[31]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[32]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[33]  Jack Belzer,et al.  Encyclopedia of Computer Science and Technology , 2002 .

[34]  Gonzalo Navarro,et al.  t-Spanners for metric space searching , 2007, Data Knowl. Eng..

[35]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[36]  Nora Reyes,et al.  List of twin clusters: a data structure for similarity joins in metric spaces , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[37]  Gonzalo Navarro,et al.  Practical Construction of k-Nearest Neighbor Graphs in Metric Spaces , 2006, WEA.

[38]  H. Samet,et al.  Incremental Similarity Search in Multimedia Databases , 2000 .

[39]  Gonzalo Navarro,et al.  A compact space decomposition for effective metric indexing , 2005, Pattern Recognit. Lett..

[40]  Jan Paredaens,et al.  Advances in Database Systems , 1994 .

[41]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[42]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[43]  Tzi-cker Chiueh,et al.  Content-Based Image Indexing , 1994, VLDB.

[44]  Ming-Ling Lo,et al.  Spatial hash-joins , 1996, SIGMOD '96.

[45]  Edgar Chávez,et al.  Using the k-Nearest Neighbor Graph for Proximity Searching in Metric Spaces , 2005, SPIRE.

[46]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[47]  Nieves R. Brisaboa,et al.  Spatial Selection of Sparse Pivots for Similarity Search in Metric Spaces , 2007, SOFSEM.

[48]  FABRIZIO ANGIULLI,et al.  Approximate k-Closest-Pairs in Large High-Dimensional Data Sets , 2005, J. Math. Model. Algorithms.

[49]  Clara Pizzuti,et al.  Approximate k -Closest-Pairs with Space Filling Curves , 2002, DaWaK.

[50]  Ming-Ling Lo,et al.  The Design and Implementation of Seeded Trees: An Efficient Method for Spatial Joins , 1998, IEEE Trans. Knowl. Data Eng..

[51]  SametHanan,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003 .

[52]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[53]  Gonzalo Navarro,et al.  Dynamic spatial approximation trees , 2001, SCCC 2001. 21st International Conference of the Chilean Computer Science Society.

[54]  Gonzalo Navarro,et al.  Fixed Queries Array: A Fast and Economical Data Structure for Proximity Searching , 2001, Multimedia Tools and Applications.