Bucketing Coding and Information Theory for the Statistical High-Dimensional Nearest-Neighbor Problem

The problem of finding high-dimensional approximate nearest neighbors is considered when the data is generated by some known probabilistic model. A large natural class of algorithms (bucketing codes) is investigated, Bucketing information is defined, and is proven to bound the performance of all bucketing codes. The bucketing information bound is asymptotically attained by some randomly constructed bucketing codes. The example of <i>n</i> Bernoulli(1/2) very long (length <i>d</i> → ∞) sequences of bits is singled out. It is assumed that <i>n</i> - 2<i>m</i> sequences are completely independent, while the remaining <i>2m</i> sequences are composed of <i>m</i> dependent pairs. The interdependence within each pair is that their bits agree with probability <i>1/2 <; p</i> ≤ 1. It is well known how to find most pairs with high probability by performing order of <i>n</i><sup>log</sup><sub>2</sub>2/<i>p</i> comparisons. It is shown that order of <i>n</i><sup>1/p+∈</sup>comparisons suffice, for any ∈ <i>> 0</i>. A specific 2-D inequality (proven in another paper) implies that the exponent <i>1/p</i> cannot be lowered. Moreover, if one sequence out of each pair belongs to a known set of <i>n</i><sup>(2p-1)</sup><sup>2</sup> sequences, pairing can be done using order <i>n</i><sup>1+∈</sup> comparisons!

[1]  Geoffrey Zweig,et al.  The bit vector intersection problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[2]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[3]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[4]  T. McCalmont,et al.  The light bulb , 2012, Journal of cutaneous pathology.

[5]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[6]  Pavel Zezula,et al.  Similarity search in metric databases through hashing , 2001, MULTIMEDIA '01.

[7]  Sanguthevar Rajasekaran,et al.  The light bulb problem , 1995, COLT '89.

[8]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[9]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[10]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[11]  Moshe Dubiner A Heterogeneous High-Dimensional Approximate Nearest Neighbor Algorithm , 2012, IEEE Transactions on Information Theory.

[12]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..