Billion-Scale Similarity Search with GPUs

Similarity search finds application in database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data parallel tasks such as distance computation, prior approaches in this domain are bottlenecked by algorithms that expose less parallelism, such as <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="johnson-ieq1-2921572.gif"/></alternatives></inline-formula>-min selection, or make poor use of the memory hierarchy. We propose a novel design for <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="johnson-ieq2-2921572.gif"/></alternatives></inline-formula>-selection. We apply it in different similarity search scenarios, by optimizing brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation operates at up to 55 percent of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5 × faster than prior GPU state of the art. It enables the construction of a high accuracy <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="johnson-ieq3-2921572.gif"/></alternatives></inline-formula>-NN graph on 95 million images from the <sc>Yfcc100M</sc> dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.

[1]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[2]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[3]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[4]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[5]  Pradeep Dubey,et al.  Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..

[6]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[7]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[8]  Laurent Amsaleg,et al.  Locality sensitive hashing: A comparison of hash function types and querying mechanisms , 2010, Pattern Recognit. Lett..

[9]  Matthijs Douze,et al.  Searching in one billion vectors: Re-rank with source coding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Dinesh Manocha,et al.  Fast GPU-based locality sensitive hashing for k-nearest neighbor computation , 2011, GIS.

[11]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[13]  Laura Monroe,et al.  Randomized selection on the GPU , 2011, HPG '11.

[14]  Mauricio Marín,et al.  kNN Query Processing in Metric Spaces Using GPUs , 2011, Euro-Par.

[15]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[16]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Xiaobai Sun,et al.  Parallel search of k-nearest neighbors with synchronous operations , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[18]  Xi He,et al.  Design and implementation of a parallel priority queue on many-core architectures , 2012, 2012 19th International Conference on High Performance Computing.

[19]  Jeffrey D. Blanchard,et al.  Fast k-selection algorithms for graphics processing units , 2012, JEAL.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Victor Lempitsky,et al.  The inverted multi-index , 2012, CVPR.

[22]  David J. Fleet,et al.  Cartesian K-Means , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Jian Sun,et al.  K-Means Hashing: An Affinity-Preserving Quantization Method for Learning Binary Compact Codes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  John Canny,et al.  BIDMach: Large-scale Learning with Zero Memory Allocation , 2013 .

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Ali Dashti,et al.  Efficient Computation of k-Nearest Neighbour Graphs for Large High-Dimensional Data Sets on GPU Clusters , 2013, PloS one.

[27]  Dan Klein,et al.  A Multi-Teraflop Constituency Parser using GPUs , 2013, EMNLP.

[28]  Yannis Avrithis,et al.  Locally Optimized Product Quantization for Approximate Nearest Neighbor Search , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Michael Garland,et al.  A decomposition for in-place matrix transposition , 2014, PPoPP '14.

[30]  Hiroshi Sawada,et al.  Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets , 2014, IEICE Trans. Inf. Syst..

[31]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[32]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[33]  Brian Vinter,et al.  ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors , 2014, GPGPU@ASPLOS.

[34]  Victor S. Lempitsky,et al.  Improving Bilayer Product Quantization for Billion-Scale Approximate Nearest Neighbors in High Dimensions , 2014, ArXiv.

[35]  Jian Sun,et al.  Optimized Product Quantization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Ohad Shamir,et al.  Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation , 2013, NIPS.

[37]  Akiyoshi Wakatani,et al.  GPGPU Implementation of Nearest Neighbor Search with Product Quantization , 2014, 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[38]  Anne-Marie Kermarrec,et al.  Cache locality is not enough: High-Performance Nearest Neighbor Search with Product Quantization Fast Scan , 2015, Proc. VLDB Endow..

[39]  Yannis Avrithis,et al.  Web-Scale Image Clustering Revisited , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Minyi Guo,et al.  Efficient Selection Algorithm for Fast k-NN Search on GPUs , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[41]  Wolfgang Lehner,et al.  Special Issue: Modern Hardware , 2016, The VLDB Journal.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[44]  Hendrik P. A. Lensch,et al.  Efficient Large-Scale Approximate Nearest Neighbor Search on the GPU , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Matthijs Douze,et al.  Polysemous Codes , 2016, ECCV.

[46]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[47]  Victor S. Lempitsky,et al.  Efficient Indexing of Billion-Scale Datasets of Deep Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.