Why locality sensitive hashing works: A practical perspective

Abstract Locality Sensitive Hashing (LSH) is one of the most efficient approaches to the nearest neighbor search problem in high dimensional spaces. A family H of hash functions is called locality sensitive if the collision probability p h ( r ) of any two points 〈 q , p 〉 at distance r over a random hash function h decreases with r. The classic LSH algorithm employs a data structure consisting of k ⁎ l randomly chosen hash functions to achieve more desirable collision curves and the collision probability P h k l ( r ) for 〈 q , p 〉 is equal to 1 − ( 1 − p h ( r ) k ) l . The great success of LSH is usually attributed to the solid theoretical guarantee for P h k l ( r ) and p h ( r ) . In practice, however, users are more interested in recall rate, i.e., the probability that a random query collides with its r-near neighbor over a fixed LSH data structure h l k . Implicitly or explicitly, P h k l ( r ) is often misinterpreted as recall rate and used to predict the performance of LSH. This is problematic because P h k l ( r ) is actually the expectation of recall rates. Interestingly, numerous empirical studies show that, for most (if not all) real datasets and a fixed sample of random LSH data structure, the recall rate is very close to P h k l ( r ) . In this paper, we provide a theoretical justification for this phenomenon. We show that (1) for random datasets the recall rate is asymptotically equal to P h k l ( r ) ; (2) for arbitrary datasets the variance of the recall rate is very small as long as the parameter k and l are properly chosen and the size of datasets is large enough. Our analysis (1) explains why the practical performance of LSH (the recall rate) matches so well with the theoretical expectation ( P h k l ( r ) ); and (2) indicates that, in addition to the nice theoretical guarantee, the mechanism by which LSH data structures are constructed and the huge amount of data are also the main causes for the success of LSH in practice.

[1]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[2]  Rajeev Motwani,et al.  Lower Bounds on Locality Sensitive Hashing , 2007, SIAM J. Discret. Math..

[3]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[4]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[5]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[6]  LihChyun Shu,et al.  Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis , 2013, CIKM.

[7]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[8]  Olivier Buisson,et al.  A posteriori multi-probe locality sensitive hashing , 2008, ACM Multimedia.

[9]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[10]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[11]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[12]  Kyuseok Shim,et al.  Similarity Join Size Estimation using Locality Sensitive Hashing , 2011, Proc. VLDB Endow..

[13]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[14]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[15]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[16]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.