Index-driven similarity search in metric spaces (Survey Article)

Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distance-based indexing), while the second is based on mapping to a vector space (mapping-based approach). The main part of this article is dedicated to a survey of distance-based indexing methods, but we also briefly outline how search occurs in mapping-based methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary "search hierarchy." These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.

[1]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[2]  Laveen N. Kanal,et al.  Analysis of k-nearest neighbor branch and bound rules , 1986, Pattern Recognit. Lett..

[3]  Sharad Mehrotra,et al.  The hybrid tree: an index structure for high dimensional feature spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[4]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[5]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[6]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[7]  Ricardo A. Baeza-Yates,et al.  Spaghettis: an array based algorithm for similarity queries in metric spaces , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[8]  Gonzalo Navarro Searching in metric spaces by spatial approximation , 2002, The VLDB Journal.

[9]  Tzi-cker Chiueh,et al.  Content-Based Image Indexing , 1994, VLDB.

[10]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[11]  Hartmut Noltemeier,et al.  A Data Structure for Representing and Efficient Querying Large Scenes of Geometric Objects: MB* Trees , 1993, Geometric Modelling.

[12]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[13]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[14]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[15]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[16]  James McNames,et al.  A Nearest Trajectory Strategy for Time Series Prediction , 2000 .

[17]  Parlitz,et al.  Fast nearest-neighbor searching for nonlinear signal processing , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[18]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[19]  Sukho Lee,et al.  Adaptive multi-stage distance join processing , 2000, SIGMOD 2000.

[20]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[21]  Enrique Vidal-Ruiz,et al.  An algorithm for finding nearest neighbours in (approximately) constant average time , 1986, Pattern Recognit. Lett..

[22]  Juan Miguel Vilar,et al.  Reducing the Overhead of the AESA Metric-Space Nearest Neighbour Searching Algorithm , 1995, Inf. Process. Lett..

[23]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[24]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[25]  David M. Mount,et al.  The Analysis of a Probabilistic Approach to Nearest Neighbor Searching , 2001, WADS.

[26]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[27]  Sameer A. Nene,et al.  A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[28]  Ada Wai-Chee Fu,et al.  Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances , 2000, The VLDB Journal.

[29]  Kuldip K. Paliwal,et al.  An efficient approximation-elimination algorithm for fast nearest-neighbour search based on a spherical distance coordinate formulation , 1992, Pattern Recognit. Lett..

[30]  Hanan Samet,et al.  Improved search heuristics for the sa-tree , 2003, Pattern Recognit. Lett..

[31]  Jack A. Orenstein Multidimensional Tries Used for Associative Searching , 1982, Inf. Process. Lett..

[32]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[33]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[34]  Hanan Samet,et al.  Properties of Embedding Methods for Similarity Searching in Metric Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Sukho Lee,et al.  Adaptive multi-stage distance join processing , 2000, SIGMOD '00.

[36]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[37]  Sharad Mehrotra,et al.  High dimensional feature indexing using hybrid trees , 1998, ICDE 1998.

[38]  Enrique Vidal,et al.  New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (AESA) , 1994, Pattern Recognit. Lett..

[39]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[40]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[41]  Iraj Kalantari,et al.  A Data Structure and an Algorithm for the Nearest Point Problem , 1983, IEEE Transactions on Software Engineering.

[42]  Marco Patella,et al.  Searching in metric spaces with user-defined and approximate distances , 2002, TODS.

[43]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[44]  H. Gabriela,et al.  Cluster-preserving Embedding of Proteins , 1999 .

[45]  Pavel Zezula,et al.  A cost model for similarity queries in metric spaces , 1998, PODS '98.

[46]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[47]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[48]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[49]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[50]  Peter van Oosterom,et al.  Reactive Data Structures for Geographic Information Systems , 1993 .

[51]  Georges Voronoi Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Deuxième mémoire. Recherches sur les parallélloèdres primitifs. , 1908 .

[52]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[53]  Forest Baskett,et al.  An Algorithm for Finding Nearest Neighbors , 1975, IEEE Transactions on Computers.

[54]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[55]  Gonzalo Navarro,et al.  An effective clustering algorithm to index high dimensional metric spaces , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[56]  Hanan Samet,et al.  Hierarchical Spatial Data Structures , 1989, SSD.

[57]  Gonzalo Navarro,et al.  Fully Dynamic Spatial Approximation Trees , 2002, SPIRE.

[58]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[59]  Henry Fuchs,et al.  On visible surface generation by a priori tree structures , 1980, SIGGRAPH '80.

[60]  H. Samet,et al.  Incremental Similarity Search in Multimedia Databases , 2000 .

[61]  Luisa Micó,et al.  A fast branch & bound nearest neighbour classifier in metric spaces , 1996, Pattern Recognit. Lett..

[62]  Nuno Vasconcelos,et al.  Statistical models of video structure for content analysis and characterization , 2000, IEEE Trans. Image Process..

[63]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[64]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[65]  John C. Dalton,et al.  Hierarchical browsing and search of large image databases , 2000, IEEE Trans. Image Process..

[66]  Peter Yianilos,et al.  Excluded middle vantage point forests for nearest neighbor search , 1998 .

[67]  Andreas Henrich A Distance Scan Algorithm for Spatial Access Structures , 1994, ACM-GIS.

[68]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[69]  David M. Mount,et al.  An Empirical Study of a New Approach to Nearest Neighbor Searching , 2001, ALENEX.

[70]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[71]  Johan A. K. Suykens,et al.  WINNING ENTRY OF THE K. U. LEUVEN TIME-SERIES PREDICTION COMPETITION , 1999 .

[72]  E. Vidal,et al.  An algorithm for finding nearest neighbours in constant average time with a linear space complexity , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[73]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[74]  Michael T. Goodrich,et al.  Balanced aspect ratio trees: combining the advantages of k-d trees and octrees , 1999, SODA '99.

[75]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[76]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[77]  Hartmut Noltemeier,et al.  Monotonous Bisector* Trees - A Tool for Efficient Partitioning of Complex Scenes of Geometric Objects , 1992, Data Structures and Efficient Algorithms.

[78]  Steven K. Feiner,et al.  Computer graphics: principles and practice (2nd ed.) , 1990 .

[79]  Dennis Shasha,et al.  Query Processing for Distance Metrics , 1990, VLDB.

[80]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[81]  Rakesh Agrawal,et al.  Parallel Algorithms for High-Dimensional Proximity Joins , 1998 .

[82]  Thomas Ertl,et al.  Computer Graphics - Principles and Practice, 3rd Edition , 2014 .

[83]  Marvin B. Shapiro The choice of reference points in best-match file searching , 1977, CACM.

[84]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[85]  Rakesh Agrawal,et al.  Parallel Algorithms for High-dimensional Similarity Joins for Data Mining Applications , 1997, Very Large Data Bases Conference.

[86]  Gonzalo Navarro,et al.  Fixed Queries Array: A Fast and Economical Data Structure for Proximity Searching , 2001, Multimedia Tools and Applications.

[87]  Philip M. Hubbard,et al.  Approximating polyhedra with spheres for time-critical collision detection , 1996, TOGS.

[88]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[89]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[90]  Christos Faloutsos,et al.  Fast Indexing and Visualization of Metric Data Sets using Slim-Trees , 2002, IEEE Trans. Knowl. Data Eng..

[91]  Pavel Zezula,et al.  Similarity search in metric databases through hashing , 2001, MULTIMEDIA '01.

[92]  Behrooz Kamgar-Parsi,et al.  An improved branch and bound algorithm for computing k-nearest neighbors , 1985, Pattern Recognit. Lett..

[93]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[94]  F. DEHNE,et al.  Voronoi trees and clustering problems , 1987, Inf. Syst..

[95]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[96]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[97]  Hanan Samet,et al.  Spatial Data Structures , 1995, Modern Database Systems.

[98]  Kaizhong Zhang,et al.  Evaluating a class of distance-mapping algorithms for data mining and clustering , 1999, KDD '99.

[99]  Marco Patella,et al.  Bulk Loading the M-tree , 2001 .