A Heterogeneous High-Dimensional Approximate Nearest Neighbor Algorithm

We consider the problem of finding high-dimensional approximate nearest neighbors. We introduce an old style probabilistic formulation instead of the more general locality sensitive hashing (LSH) formulation, and show that at least for sparse problems it recognizes much more efficient algorithms than the sparseness destroying LSH random projections. Efficient algorithms for homogeneous (all coordinates have the same probability distribution) problems are well known, the most famous reference being the work by Broder in 1998. The main theme of this paper is to find its “best” generalization to heterogeneous (different coordinate probabilities) problems. We find a practical algorithm which is asymptotically best in a wide natural class of algorithms. Readers interested in the more complicated very best (at least up to date) can look up our previous work in 2010. The analysis of our algorithms reveals that its complexity is governed by an information like function, which we call “small leaves bucketing forest information.” Any doubts whether it is “information” are dispelled by the aforementioned work.

[1]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[2]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[3]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[4]  Moshe Dubiner,et al.  Bucketing Coding and Information Theory for the Statistical High-Dimensional Nearest-Neighbor Problem , 2008, IEEE Transactions on Information Theory.

[5]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[6]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[8]  Michael Ian Shamos,et al.  Closest-point problems , 1975, 16th Annual Symposium on Foundations of Computer Science (sfcs 1975).

[9]  Geoffrey Zweig,et al.  The bit vector intersection problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[10]  Pavel Zezula,et al.  Similarity search in metric databases through hashing , 2001, MULTIMEDIA '01.