Searching in metric spaces

The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of "intrinsic dimensionality." We also present a unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy that allows us to devise new algorithms from combinations of concepts not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development.

[1]  Divyakant Agrawal,et al.  Efficient disk allocation for fast similarity searching , 1998, SPAA '98.

[2]  Kurt Mehlhorn,et al.  Data Structures and Algorithms 3: Multi-dimensional Searching and Computational Geometry , 2012, EATCS Monographs on Theoretical Computer Science.

[3]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[4]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[5]  BozkayaTolga,et al.  Distance-based indexing for high-dimensional metric spaces , 1997 .

[6]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[7]  Atsuo Yoshitaka,et al.  A Survey on Content-Based Retrieval for Multimedia Databases , 1999, IEEE Trans. Knowl. Data Eng..

[8]  Peter M. G. Apers,et al.  Multimedia Databases in Perspective , 1997, Springer London.

[9]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[10]  Pavel Zezula,et al.  Processing Complex Similarity Queries with Distance-Based Access Methods , 1998, EDBT.

[11]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[12]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  Stephen Blott,et al.  A Simple Vector-Approximation File for Similarity Search in High-Dimensional Vector Spaces , 1997 .

[14]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[15]  Dennis Shasha,et al.  New techniques for best-match retrieval , 1990, TOIS.

[16]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[17]  Bruce W. Weide,et al.  Optimal Expected-Time Algorithms for Closest Point Problems , 1980, TOMS.

[18]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[19]  Luisa Micó,et al.  A fast branch & bound nearest neighbour classifier in metric spaces , 1996, Pattern Recognit. Lett..

[20]  Gonzalo Navarro,et al.  Overcoming the Curse of Dimensionality , 1999 .

[21]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[22]  Gonzalo Navarro Searching in metric spaces by spatial approximation , 2002, The VLDB Journal.

[23]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[24]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[25]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[26]  L. Childs A concrete introduction to higher algebra , 1978 .

[27]  L. Devroye A Course in Density Estimation , 1987 .

[28]  S. Sclaroff,et al.  Combining textual and visual cues for content-based image retrieval on the World Wide Web , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[29]  Gonzalo Navarro,et al.  A Probabilistic Spell for the Curse of Dimensionality , 2001, ALENEX.

[30]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[31]  Hartmut Noltemeier,et al.  Monotonous Bisector* Trees - A Tool for Efficient Partitioning of Complex Scenes of Geometric Objects , 1992, Data Structures and Efficient Algorithms.

[32]  Christos Faloutsos,et al.  Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension , 1994, PODS.

[33]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[34]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[35]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[36]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[37]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[38]  Pavel Zezula,et al.  A cost model for similarity queries in metric spaces , 1998, PODS '98.

[39]  Sameer A. Nene,et al.  A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[40]  Marvin B. Shapiro The choice of reference points in best-match file searching , 1977, CACM.

[41]  L. Goddard First Course , 1969, Nature.

[42]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[43]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[44]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[45]  Raj Jain,et al.  Algorithms and strategies for similarity retrieval , 1996 .

[46]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[47]  Gonzalo Navarro,et al.  An effective clustering algorithm to index high dimensional metric spaces , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[48]  James R. Munkres,et al.  Topology; a first course , 1974 .

[49]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[50]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .

[51]  Tzi-cker Chiueh,et al.  Content-Based Image Indexing , 1994, VLDB.

[52]  Ricardo A. Baeza-Yates,et al.  Fast approximate string matching in a dictionary , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[53]  E. Chávez,et al.  Measuring the Dimensionality of General Metric Spaces , 2000 .

[54]  András Faragó,et al.  Fast Nearest-Neighbor Search in Dissimilarity Spaces , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[55]  Rolph E. Anderson,et al.  Multivariate data analysis (4th ed.): with readings , 1995 .

[56]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[57]  Knut Verbarg The C-Tree: A Dynamically Balanced Spatial Index , 1993, Geometric Modelling.

[58]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[59]  Franz Aurenhammer,et al.  Voronoi diagrams—a survey of a fundamental geometric data structure , 1991, CSUR.

[60]  Kenneth L. Clarkson,et al.  Nearest Neighbor Queries in Metric Spaces , 1997, STOC '97.

[61]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[62]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[63]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[64]  Noltemeier Hartmut,et al.  Voronoi Trees and Applications , 1989 .

[65]  Iraj Kalantari,et al.  A Data Structure and an Algorithm for the Nearest Point Problem , 1983, IEEE Transactions on Software Engineering.

[66]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[67]  Peter Yianilos,et al.  Excluded middle vantage point forests for nearest neighbor search , 1998 .

[68]  Bir Bhanu,et al.  Learning feature relevance and similarity metrics in image databases , 1998, Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No.98EX173).

[69]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[70]  Gonzalo Navarro,et al.  A metric index for approximate string matching , 2002, Theor. Comput. Sci..

[71]  Bernard Chazelle,et al.  Computational geometry: a retrospective , 1994, STOC '94.

[72]  Ricardo A. Baeza-Yates,et al.  Spaghettis: an array based algorithm for similarity queries in metric spaces , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[73]  Jon Louis Bentley,et al.  Multidimensional Binary Search Trees in Database Applications , 1979, IEEE Transactions on Software Engineering.

[74]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[75]  Peter N. Yianilos,et al.  Locally lifting the curse of dimensionality for nearest neighbor search (extended abstract) , 2000, SODA '00.

[76]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[77]  E. Vicario,et al.  Using weighted spatial relationships in retrieval by visual contents , 1998 .

[78]  F. DEHNE,et al.  Voronoi trees and clustering problems , 1987, Inf. Syst..