Learning Sublinear-Time Indexing for Nearest Neighbor Search

Most of the efficient sublinear-time indexing algorithms for the high-dimensional nearest neighbor search problem (NNS) are based on space partitions of the ambient space $\mathbb{R}^d$. Inspired by recent theoretical work on NNS for general metric spaces [Andoni, Naor, Nikolov, Razenshteyn, Waingarten STOC 2018, FOCS 2018], we develop a new framework for constructing such partitions that reduces the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner [Sanders, Schulz SEA 2013] and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS, our experiments show that the partitions found by Neural LSH consistently outperform partitions found by quantization- and tree-based methods.

[1]  Ludwig Schmidt,et al.  Learning Representations for Faster Similarity Search , 2018 .

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[4]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Alexandr Andoni,et al.  Hölder Homeomorphisms and Approximate Nearest Neighbors , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[7]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[8]  Alexandr Andoni,et al.  Approximate Nearest Neighbor Search in High Dimensions , 2018, Proceedings of the International Congress of Mathematicians (ICM 2018).

[9]  Peter Sanders,et al.  Think Locally, Act Globally: Highly Balanced Graph Partitioning , 2013, SEA.

[10]  Patrick Pérez,et al.  SuBiC: A Supervised, Structured Binary Code for Image Search , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Xuemin Lin,et al.  SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index , 2014, Proc. VLDB Endow..

[12]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[13]  Robert F. Sproull,et al.  Refinements to nearest-neighbor searching ink-dimensional trees , 1991, Algorithmica.

[14]  Sanjoy Dasgupta,et al.  A neural algorithm for a fundamental computing problem , 2017 .

[15]  Martin Aumüller,et al.  ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms , 2018, SISAP.

[16]  Alexandr Andoni,et al.  Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors , 2016, SODA.

[17]  Sanjoy Dasgupta,et al.  Randomized partition trees for exact nearest neighbor search , 2013, COLT.

[18]  Maria-Florina Balcan,et al.  Learning to Branch , 2018, ICML.

[19]  Cordelia Schmid,et al.  Spreading vectors for similarity search , 2018, ICLR.

[20]  Jiwen Lu,et al.  Deep hashing for compact binary codes learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Victor Lempitsky,et al.  The inverted multi-index , 2012, CVPR.

[22]  Sanjiv Kumar,et al.  Multiscale Quantization for Fast Similarity Search , 2017, NIPS.

[23]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[24]  Alexandr Andoni,et al.  Spectral Approaches to Nearest Neighbor Search , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[25]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[26]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[27]  Lior Wolf,et al.  In Defense of Product Quantization , 2017, ArXiv.

[28]  Shree K. Nayar,et al.  What Is a Good Nearest Neighbors Algorithm for Finding Similar Patches in Images? , 2008, ECCV.

[29]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Wei Liu,et al.  Learning to Hash for Indexing Big Data—A Survey , 2015, Proceedings of the IEEE.

[31]  Qin Zhang,et al.  EmbedJoin: Efficient Edit Similarity Joins via Embeddings , 2017, KDD.

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Alexandr Andoni,et al.  Data-dependent hashing via nonlinear spectral gaps , 2018, STOC.

[34]  Kaushik Sinha,et al.  Improved nearest neighbor search using auxiliary information and priority functions , 2018, ICML.

[35]  Sergei Vassilvitskii,et al.  Competitive caching with machine learned advice , 2018, ICML.

[36]  Jian Sun,et al.  Optimized Product Quantization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Michael Mitzenmacher,et al.  A Model for Learned Bloom Filters and Optimizing by Sandwiching , 2018, NeurIPS.

[38]  David J. Fleet,et al.  Cartesian K-Means , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.