Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing

All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some extent and provide substantial speedup over traditional index based approaches. BayesLSH is used for pruning the candidate space and computation of approximate similarity, whereas BayesLSHLite can only prune the candidates, but similarity needs to be computed exactly on the original data. Thus where ever the explicit data representation is available and exact similarity computation is not too expensive, BayesLSHLite can be used to aggressively prune candidates and provide substantial speedup without losing too much on quality. However, the loss in quality is higher in the BayesLSH variant, where explicit data representation is not available, rather only a hash sketch is available and similarity has to be estimated approximately. In this work we revisit the LSH problem from a Frequentist setting and formulate sequential tests for composite hypothesis (similarity greater than or less than threshold) that can be leveraged by such LSH algorithms for adaptively pruning candidates aggressively. We propose a vanilla sequential probability ratio test (SPRT) approach based on this idea and two novel variants. We extend these variants to the case where approximate similarity needs to be computed using fixed-width sequential confidence interval generation technique. We compare these novel variants with the SPRT variant and BayesLSH/Bayes-LSHLite variants and show that they can provide tighter qualitative guarantees over BayesLSH/BayesLSHLite -- a state-of-the-art approach -- while being upto 2.1x faster than a traditional SPRT and 8.8x faster than AllPairs.

[1]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[2]  Frederick Mosteller,et al.  Unbiased Estimates for Certain Binomial Sampling Problems with Applications , 1946 .

[3]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[4]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[5]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[6]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[7]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[8]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[10]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[11]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[12]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[13]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[14]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[15]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[16]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jesse Frey,et al.  Fixed-Width Sequential Confidence Intervals for a Proportion , 2010 .

[18]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[19]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[20]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[21]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[22]  Rizal Setya Perdana What is Twitter , 2013 .

[23]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.