Locality sensitive hash functions based on concomitant rank order statistics

Locality Sensitive Hash functions are invaluable tools for approximate near neighbor problems in high dimensional spaces. In this work, we are focused on LSH schemes where the similarity metric is the cosine measure. The contribution of this work is a new class of locality sensitive hash functions for the cosine similarity measure based on the theory of concomitants, which arises in order statistics. Consider <i>n</i> i.i.d sample pairs, {(<i>X</i><sub>1</sub>; <i>Y</i><sub>1</sub>); (<i>X</i><sub>2</sub>; <i>Y</i><sub>2</sub>); : : : ;(<i>X</i><sub><i>n</i></sub>; <i>Y</i><sub><i>n</i></sub>)} obtained from a bivariate distribution <i>f</i>(<i>X, Y</i>). Concomitant theory captures the relation between the order statistics of <i>X</i> and <i>Y</i> in the form of a rank distribution given by Prob(Rank(<i>Y</i><sub>i</sub>)=<i>j</i>-Rank(Xi)=<i>k</i>). We exploit properties of the rank distribution towards developing a locality sensitive hash family that has excellent collision rate properties for the cosine measure. The computational cost of the basic algorithm is high for high hash lengths. We introduce several approximations based on the properties of concomitant order statistics and discrete transforms that perform almost as well, with significantly reduced computational cost. We demonstrate the practical applicability of our algorithms by using it for finding similar images in an image repository.

[1]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[2]  H. A. David,et al.  Distribution and Expected Value of the Rank of a Concomitant of an Order Statistic , 1977 .

[3]  Daniel Zwillinger,et al.  CRC standard mathematical tables and formulae; 30th edition , 1995 .

[4]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[5]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[6]  Ming Ma,et al.  A Content-based Image Retrieval using FFT & Cosine Similarity Coefficient , 2003, Signal and Image Processing.

[7]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[8]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[9]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[10]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[11]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[12]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[13]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[14]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[15]  Michael Isard,et al.  General Theory , 1969 .

[16]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..