Characterizing the optimal pivots for efficient similarity searches in vector space databases with Minkowski distances

Abstract Pivot-based retrieval algorithms are commonly used to solve similarity queries in a number of application domains, such as multimedia retrieval, biomedical databases, time series and computer vision. The query performances of pivot-based index algorithms can be significantly improved by properly choosing the set of pivots that is able to narrow down the database elements to only those relevant to a query. While many other approaches in the literature rely on empirical studies or intuitive observations and assumptions to achieve effective pivot strategies, this paper addresses the problem by using a formal mathematical approach. We conclude in our study that the optimal set of pivots in vector databases with Lp metrics is a set of uniformly distributed points on the surface of an n-sphere defined by these metrics. To make the study mathematically tractable, a uniform distribution of data in the database is assumed, allowing us to outline the problem from a purely geometrical point of view. Then, we present experimental results demonstrating the usefulness of our characterization when applied to real databases in the ( R n , L p ) metric space. Our technique is shown to outperform comparable techniques in the literature. However, we do not propose a new pivot-selection technique but rather experiments that are designed exclusively to show the usefulness of such a characterization.

[1]  Luisa Micó,et al.  A fast branch & bound nearest neighbour classifier in metric spaces , 1996, Pattern Recognit. Lett..

[2]  Ricardo A. Baeza-Yates,et al.  Spaghettis: an array based algorithm for similarity queries in metric spaces , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[3]  Igor I. Baskin,et al.  Molecular Similarity. 1. Analytical Description of the Set of Graph Similarity Measures , 1998, J. Chem. Inf. Comput. Sci..

[4]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[5]  Nieves R. Brisaboa,et al.  Spatial Selection of Sparse Pivots for Similarity Search in Metric Spaces , 2007, SOFSEM.

[6]  Jeffrey C. Lagarias,et al.  Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions , 1998, SIAM J. Optim..

[7]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[8]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[9]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[10]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.

[11]  Cengiz Celik,et al.  Priority Vantage Points Structures for Similarity Queries in Metric Spaces , 2002, EurAsia-ICT.

[12]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[13]  Hanan Samet,et al.  Properties of Embedding Methods for Similarity Searching in Metric Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  L. Fejes Tóth,et al.  On the sum of distances determined by a pointset , 1956 .

[15]  Gonzalo Navarro,et al.  Speeding up spatial approximation search in metric spaces , 2009, JEAL.

[16]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.

[17]  Thomas Seidl,et al.  Signature Quadratic Form Distance , 2010, CIVR '10.

[18]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[19]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[20]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[21]  Yongtang Shi,et al.  Fifty years of graph matching, network alignment and network comparison , 2016, Inf. Sci..

[22]  Horst Bunke,et al.  Syntactic and Structural Pattern Recognition , 1988, NATO ASI Series.

[23]  Luisa Micó,et al.  A fast pivot-based indexing algorithm for metric spaces , 2011, Pattern Recognit. Lett..

[24]  Nieves R. Brisaboa,et al.  A Dynamic Pivot Selection Technique for Similarity Search , 2008, First International Workshop on Similarity Search and Applications (sisap 2008).

[25]  Gonzalo Navarro,et al.  Fixed Queries Array: A Fast and Economical Data Structure for Proximity Searching , 2001, Multimedia Tools and Applications.

[26]  Remco C. Veltkamp,et al.  Efficient image retrieval through vantage objects , 1999, Pattern Recognition.

[27]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[28]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[29]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[30]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[31]  Christos Faloutsos,et al.  The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient , 2007, The VLDB Journal.

[32]  Marvin B. Shapiro The choice of reference points in best-match file searching , 1977, CACM.

[33]  V. Torczon,et al.  Direct Search Methods , 2011 .

[34]  Aoying Zhou,et al.  An adaptive and dynamic dimensionality reduction method for high-dimensional indexing , 2007, The VLDB Journal.

[35]  András Faragó,et al.  Fast Nearest-Neighbor Search in Dissimilarity Spaces , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[37]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[38]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[39]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[40]  Gonzalo Navarro Searching in metric spaces by spatial approximation , 2002, The VLDB Journal.

[41]  Remco C. Veltkamp,et al.  Selecting vantage objects for similarity indexing , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[42]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .