Pivot Selection Strategies for Permutation-Based Similarity Search

Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots or reference objects. Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query. This, of course assumes that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection strategies do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivots selection strategies on three permutation-based similarity access methods. Among those, we propose a novel strategy specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested strategies. Second, there is not a strategy that is universally the best for all permutation-based access methods; rather different strategies are optimal for different methods.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[3]  Daniel P. Miranker,et al.  Dimension reduction for distance-based indexing , 2010, SISAP.

[4]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[5]  Nieves R. Brisaboa,et al.  A Dynamic Pivot Selection Technique for Similarity Search , 2008, First International Workshop on Similarity Search and Applications (sisap 2008).

[6]  David Novak,et al.  Metric Index: An efficient and scalable solution for precise and approximate similarity search , 2011, Inf. Syst..

[7]  Marvin B. Shapiro The choice of reference points in best-match file searching , 1977, CACM.

[8]  Gonzalo Navarro,et al.  Effective Proximity Retrieval by Ordering Permutations , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[10]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[11]  Claudio Gennaro,et al.  MI-File: using inverted files for scalable approximate similarity search , 2012, Multimedia Tools and Applications.

[12]  Pasquale Savino,et al.  Approximate similarity search in metric spaces using inverted files , 2008, Infoscale.

[13]  E. Chavez,et al.  Pivot selection techniques for proximity searching in metric spaces , 2001, SCCC 2001. 21st International Conference of the Chilean Computer Science Society.

[14]  Andrea Esuli,et al.  Use of permutation prefixes for efficient and scalable approximate similarity search , 2012, Inf. Process. Manag..

[15]  David Novak,et al.  Building a web-scale image similarity search system , 2010, Multimedia Tools and Applications.

[16]  Nieves R. Brisaboa,et al.  Spatial Selection of Sparse Pivots for Similarity Search in Metric Spaces , 2007, SOFSEM.

[17]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[18]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[19]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[20]  David Novak,et al.  On locality-sensitive indexing in generic metric spaces , 2010, SISAP.

[21]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[22]  Andrea Esuli MiPai: Using the PP-Index to Build an Efficient and Scalable Similarity Search System , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[23]  Wiebe van der Hoek,et al.  SOFSEM 2007: Theory and Practice of Computer Science , 2007 .

[24]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[25]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[26]  Gonzalo Navarro,et al.  Optimal Incremental Sorting , 2006, ALENEX.

[27]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .