Single versus Multiple Sorting in All Pairs Similarity Search

To save memory and improve speed, vectorial data such as images and signals are often represented as strings of discrete symbols (i.e., sketches). Charikar (2002) proposed a fast approximate method for finding neighbor pairs of strings by sorting and scanning with a small window. This method, which we shall call “single sorting”, is applied to locality sensitive codes and prevalently used in speed-demanding web-related applications. To improve on single sorting, we propose a novel method that employs blockwise masked sorting. Our method can dramatically reduce the number of candidate pairs which have to be verified by distance calculation in exchange with an increased amount of sorting operations. So it is especially attractive for high dimensional dense data, where distance calculation is expensive. Empirical results show the efficiency of our method in comparison to single sorting and recent fast nearest neighbor methods.

[1]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[2]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[3]  Takeaki Uno Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data , 2009, Knowledge and Information Systems.

[4]  Trevor Darrell,et al.  Learning to Hash with Binary Reconstructive Embeddings , 2009, NIPS.

[5]  William B. March,et al.  Linear-time Algorithms for Pairwise Statistical Problems , 2009, NIPS.

[6]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[7]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Cordelia Schmid,et al.  Evaluation of GIST descriptors for web-scale image search , 2009, CIVR '09.

[9]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[10]  Prateek Jain,et al.  Fast Similarity Search for Learned Metrics , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[12]  Yousef Saad,et al.  Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection , 2009, J. Mach. Learn. Res..

[13]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[14]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.

[15]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[16]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[17]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[18]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[19]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[20]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[21]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[22]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[23]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[24]  Ulrike von Luxburg,et al.  Graph Laplacians and their Convergence on Random Neighborhood Graphs , 2006, J. Mach. Learn. Res..

[25]  Antonio Torralba,et al.  Unsupervised Detection of Regions of Interest Using Iterative Link Analysis , 2009, NIPS.

[26]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[27]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.