Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O\left( {dn^{1/c^2 + o(1)} } \right) and space O\left( {dn + n^{1 + 1/c^2 + o(1)} } \right). This almost matches the lower bound for hashing-based algorithm recently obtained in [27]. We also obtain a space-efficient version of the algorithm, which uses dn+n log^{O(1)} n space, with a query time of dn^{O(1/c^2 )}. Finally, we discuss practical variants of the algorithms that utilize fast bounded-distance decoders for the Leech Lattice.

[1]  J. Leech Notes on Sphere Packings , 1967, Canadian Journal of Mathematics.

[2]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[3]  N. J. A. Sloane,et al.  Soft decoding techniques for codes and lattices, including the Golay code and the Leech lattice , 1986, IEEE Trans. Inf. Theory.

[4]  N. J. A. Sloane,et al.  Sphere Packings, Lattices and Groups , 1987, Grundlehren der mathematischen Wissenschaften.

[5]  Sanguthevar Rajasekaran,et al.  The light bulb problem , 1995, COLT '89.

[6]  G. Pisier The volume of convex bodies and Banach space geometry , 1989 .

[7]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Alexander Vardy,et al.  Maximum likelihood decoding of the Leech lattice , 1993, IEEE Trans. Inf. Theory.

[9]  F. Frances Yao,et al.  Multi-index hashing for information retrieval , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[10]  David R. Karger,et al.  Approximate graph coloring by semidefinite programming , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[11]  Ofer Amrani,et al.  The Leech lattice and the Golay code: bounded-distance decoding and multilevel constructions , 1994, IEEE Trans. Inf. Theory.

[12]  Nathan Linial,et al.  The geometry of graphs and some of its algorithmic applications , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[13]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[14]  Geoffrey Zweig,et al.  The bit vector intersection problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[15]  O. Amrani,et al.  Efficient bounded-distance decoding of the hexacode and associated decoders for the Leech lattice and the Golay code , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[16]  Alexander Vardy,et al.  Generalized minimum-distance decoding of Euclidean-space codes and lattices , 1996, IEEE Trans. Inf. Theory.

[17]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[18]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[19]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[20]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[21]  Sudipto Guha,et al.  Approximating a finite metric by a small number of tree metrics , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[22]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[23]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[24]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[25]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[26]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[27]  R. Motwani,et al.  High-Dimensional Computational Geometry , 2000 .

[28]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[29]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[30]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[31]  Sariel Har-Peled A replacement for Voronoi diagrams of near linear size , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[32]  Uriel Feige,et al.  On the optimality of the random hyperplane rounding technique for MAX CUT , 2002, Random Struct. Algorithms.

[33]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[34]  Piotr Indyk,et al.  Better algorithms for high-dimensional proximity problems via asymmetric embeddings , 2003, SODA '03.

[35]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[36]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[37]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[38]  Amit Chakrabarti,et al.  An optimal randomised cell probe lower bound for approximate nearest neighbour searching , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[39]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[40]  Alexandr Andoni,et al.  Efficient algorithms for substring near neighbor problem , 2006, SODA '06.

[41]  Alexandr Andoni,et al.  On the Optimality of the Dimensionality Reduction Method , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[42]  Ting Chen,et al.  Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing , 2006, J. Chem. Inf. Model..

[43]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[44]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[45]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[46]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[47]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[48]  Tanaka Yuzuru,et al.  Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere , 2007 .