Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online.

[1]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[2]  Yi Liu,et al.  Statistical Machine Translation for Query Expansion in Answer Retrieval , 2007, ACL.

[3]  Leonid Boytsov,et al.  Engineering Efficient and Effective Non-metric Space Library , 2013, SISAP.

[4]  Leonid Boytsov,et al.  Permutation Search Methods are Efficient, Yet Faster Search is Possible , 2015, Proc. VLDB Endow..

[5]  Vladimir Krylov,et al.  Approximate nearest neighbor algorithm based on navigable small world graphs , 2014, Inf. Syst..

[6]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[8]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[9]  Craig MacDonald,et al.  Enhancing First Story Detection using Word Embeddings , 2016, SIGIR.

[10]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[11]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[12]  Daniel Marcu,et al.  A Noisy-Channel Approach to Question Answering , 2003, ACL.

[13]  Miles Osborne,et al.  Using paraphrases for improving first story detection in news and Twitter , 2012, HLT-NAACL.

[14]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[15]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16]  Sebastiano Vigna,et al.  Quasi-succinct indices , 2012, WSDM.

[17]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[18]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[19]  Charles G. Herbermann,et al.  The Catholic Encyclopedia Volume 1 , 1913 .

[20]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[21]  W. Bruce Croft,et al.  Beyond Factoid QA: Effective Methods for Non-factoid Answer Sentence Retrieval , 2016, ECIR.

[22]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[23]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[24]  Tomás Skopal,et al.  On Fast Non-metric Similarity Search by Metric Access Methods , 2006, EDBT.

[25]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[26]  Le Zhao,et al.  Term necessity prediction , 2010, CIKM.

[27]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[28]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[29]  Panagiotis Papapetrou,et al.  Nearest Neighbor Retrieval Using Distance-Based Hashing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[30]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[33]  Nicola Ferro,et al.  Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings , 2016, European Conference on Information Retrieval.

[34]  Yasin Abbasi-Yadkori,et al.  Fast Approximate Nearest-Neighbor Search with k-Nearest Neighbor Graph , 2011, IJCAI.

[35]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[36]  Ion Androutsopoulos,et al.  Using Centroids of Word Embeddings and Word Mover’s Distance for Biomedical Document Retrieval in Question Answering , 2016, BioNLP@ACL.

[37]  Benjamin B. Kimia,et al.  Metric-based shape retrieval in large databases , 2002, Object recognition supported by user interaction for service robots.

[38]  Mihai Surdeanu,et al.  Learning to Rank Answers to Non-Factoid Questions from Web Collections , 2011, CL.

[39]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[40]  Charles L. A. Clarke,et al.  Effective measures for inter-document similarity , 2013, CIKM.

[41]  Peter Clark,et al.  Automatic Coupling of Answer Extraction and Information Retrieval , 2013, ACL.

[42]  W. Bruce Croft,et al.  Finding similar questions in large question and answer archives , 2005, CIKM '05.

[43]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[44]  Richard Socher,et al.  A Neural Network for Factoid Question Answering over Paragraphs , 2014, EMNLP.

[45]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[46]  Vibhu O. Mittal,et al.  Bridging the lexical chasm: statistical approaches to answer-finding , 2000, SIGIR '00.

[47]  David Konopnicki,et al.  Database-Inspired Search , 2005, VLDB.

[48]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[49]  Mihai Surdeanu,et al.  Higher-order Lexical Semantic Models for Non-factoid Answer Reranking , 2015, TACL.

[50]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[51]  Jing Wang,et al.  Fast Neighborhood Graph Search Using Cartesian Concatenation , 2013, 2013 IEEE International Conference on Computer Vision.

[52]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[53]  Jun Sakuma,et al.  Fast approximate similarity search in extremely high-dimensional data sets , 2005, 21st International Conference on Data Engineering (ICDE'05).

[54]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[55]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[56]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[57]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[58]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[59]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[60]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[61]  Eric Brill,et al.  Automatic question answering using the web: Beyond the Factoid , 2006, Information Retrieval.

[62]  Sunil Arya,et al.  Approximate nearest neighbor queries in fixed dimensions , 1993, SODA '93.

[63]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[64]  Daphna Weinshall,et al.  Classification with Nonmetric Distances: Image Retrieval and Class Representation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[65]  W. Bruce Croft,et al.  Retrieval models for question and answer archives , 2008, SIGIR '08.

[66]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[67]  Yi Liu,et al.  Query Rewriting Using Monolingual Statistical Machine Translation , 2010, CL.

[68]  Jimmy J. Lin,et al.  No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[69]  Gonzalo Navarro,et al.  Succinct nearest neighbor search , 2011, Inf. Syst..

[70]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[71]  Thomas Nelson,et al.  The Catholic Encyclopedia , 1976 .

[72]  Shuicheng Yan,et al.  Non-Metric Locality-Sensitive Hashing , 2010, AAAI.

[73]  Heng Ji,et al.  Two-Stage Hashing for Fast Document Retrieval , 2014, ACL.

[74]  Zhe Wang,et al.  Modeling LSH for performance tuning , 2008, CIKM '08.

[75]  James P. Callan,et al.  Structured retrieval for question answering , 2007, SIGIR.

[76]  Peter D. Turney Human-Level Performance on Word Analogy Questions by Latent Relational Analysis , 2004, ArXiv.

[77]  Vladimir Pestov Lower Bounds on Performance of Metric Tree Indexing Schemes for Exact Similarity Search in High Dimensions , 2012, Algorithmica.