OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries

State-of-the-artalgorithmsforApproximateNearestNeighborSearch (ANNS)suchasDiskANN,FAISS-IVF,andHNSWbuilddatade-pendentindicesthatoffersubstantiallybetteraccuracyandsearch efficiencyoverdata-agnosticindicesbyoverfittingtotheindex datadistribution.Whenthequerydataisdrawnfromadifferent distribution–e.g.,whenindexrepresentsimageembeddingsand queryrepresentstextualembeddings–suchalgorithmslosemuch ofthisperformanceadvantage.Onavarietyofdatasets,forafixed recalltarget,latencyisworsebyanorderofmagnitudeormorefor Out-Of-Distribution(OOD)queriesascomparedtoIn-Distribution (ID)queries.ThequestionweaddressinthisworkiswhetherANNS algorithmscanbemadeefficientforOODqueriesiftheindexcon-structionisgivenaccesstoasmallsamplesetofthesequeries.We answerpositivelybypresentingOOD-DiskANN,whichusesaspar-ingsample(1%ofindexsetsize)ofOODqueries,andprovidesupto 40%improvementinmeanquerylatencyoverSoTAalgorithmsof asimilarmemoryfootprint.OOD-DiskANNisscalableandhasthe efficiencyofgraph-basedANNSindices.Someofourcontributions canimprovequeryefficiencyforIDqueriesaswell.

[1]  Alexandre Sablayrolles,et al.  Nearest Neighbor Search with Compact Codes: A Decoder Perspective , 2021, ICMR.

[2]  Santiago Segarra,et al.  Graph Reordering for Cache-Efficient Near Neighbor Search , 2021, NeurIPS.

[3]  Jiafeng Guo,et al.  Semantic Models for the First-Stage Retrieval: A Comprehensive Review , 2021, ACM Trans. Inf. Syst..

[4]  Suhas Jayaram Subramanya,et al.  Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search , 2022, NeurIPS.

[5]  Lovekesh Vig,et al.  PnPOOD : Out-Of-Distribution Detection for Text Classification via Plug andPlay Data Augmentation , 2021, ArXiv.

[6]  Yair Carmon,et al.  Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization , 2021, ICML.

[7]  Santiago Velasco-Forero,et al.  Deep Random Projection Outlyingness for Unsupervised Anomaly Detection , 2021, ArXiv.

[8]  Matthias Hein,et al.  Provably Adversarially Robust Detection of Out-of-Distribution Data (Almost) for Free , 2021, NeurIPS.

[9]  Y. Amit,et al.  Do We Really Need to Learn Representations from In-domain Data for Outlier Detection? , 2021, ArXiv.

[10]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[11]  Xiaoliang Xu,et al.  A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search , 2021, Proc. VLDB Endow..

[12]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[13]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[14]  George J. Pappas,et al.  Model-Based Robust Deep Learning , 2020, ArXiv.

[15]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[16]  Sanjiv Kumar,et al.  Accelerating Large-Scale Inference with Anisotropic Vector Quantization , 2019, ICML.

[17]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Nick Craswell,et al.  O VERVIEW OF THE TREC 2019 DEEP LEARNING TRACK , 2020 .

[20]  Artem Babenko,et al.  Unsupervised Neural Quantization for Compressed-Domain Similarity Search , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[22]  Deng Cai,et al.  Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph , 2017, Proc. VLDB Endow..

[23]  Suhas Jayaram Subramanya,et al.  DiskANN : Fast Accurate Billion-point Nearest Neighbor Search on a Single Node , 2019 .

[24]  James J. Little,et al.  LSQ++: Lower Running Time and Higher Recall in Multi-codebook Quantization , 2018, ECCV.

[25]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[26]  Yury Malkov,et al.  Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors , 2018, ECCV.

[27]  Martin Aumüller,et al.  ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms , 2018, SISAP.

[28]  James J. Little,et al.  Revisiting Additive Quantization , 2016, ECCV.

[29]  Xuemin Lin,et al.  Speedup Graph Processing by Graph Ordering , 2016, SIGMOD Conference.

[30]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[31]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[32]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[33]  Ji Wan,et al.  Deep Learning for Content-Based Image Retrieval: A Comprehensive Study , 2014, ACM Multimedia.

[34]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[35]  Yannis Avrithis,et al.  Locally Optimized Product Quantization for Approximate Nearest Neighbor Search , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Victor Lempitsky,et al.  Additive Quantization for Extreme Vector Compression , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Jingdong Wang,et al.  Composite Quantization for Approximate Nearest Neighbor Search , 2014, ICML.

[38]  Junqing Yu,et al.  Efficient approximate nearest neighbor search by optimized residual vector quantization , 2014, 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI).

[39]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[40]  Jian Sun,et al.  Optimized Product Quantization for Approximate Nearest Neighbor Search , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[43]  Fuhui Long,et al.  Fundamentals of Content-Based Image Retrieval , 2003 .

[44]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.