论文信息 - Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online.

[1] Kai Li,et al. Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[2] Yi Liu,et al. Statistical Machine Translation for Query Expansion in Answer Retrieval , 2007, ACL.

[3] Leonid Boytsov,et al. Engineering Efficient and Effective Non-metric Space Library , 2013, SISAP.

[4] Leonid Boytsov,et al. Permutation Search Methods are Efficient, Yet Faster Search is Possible , 2015, Proc. VLDB Endow..

[5] Vladimir Krylov,et al. Approximate nearest neighbor algorithm based on navigable small world graphs , 2014, Inf. Syst..

[6] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7] Tie-Yan Liu,et al. Learning to rank for information retrieval , 2009, SIGIR.

[8] Claudio Carpineto,et al. A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[9] Craig MacDonald,et al. Enhancing First Story Detection using Word Embeddings , 2016, SIGIR.

[10] W. Bruce Croft,et al. Linear feature-based models for information retrieval , 2007, Information Retrieval.

[11] Miles Osborne,et al. Streaming First Story Detection with application to Twitter , 2010, NAACL.

[12] Daniel Marcu,et al. A Noisy-Channel Approach to Question Answering , 2003, ACL.

[13] Miles Osborne,et al. Using paraphrases for improving first story detection in news and Twitter , 2012, HLT-NAACL.

[14] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[15] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[16] Sebastiano Vigna,et al. Quasi-succinct indices , 2012, WSDM.

[17] Rafail Ostrovsky,et al. Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[18] Christos Faloutsos,et al. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[19] Charles G. Herbermann,et al. The Catholic Encyclopedia Volume 1 , 1913 .

[20] Iryna Gurevych,et al. A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[21] W. Bruce Croft,et al. Beyond Factoid QA: Effective Methods for Non-factoid Answer Sentence Retrieval , 2016, ECIR.

[22] Jonathan Goldstein,et al. When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[23] Christos Faloutsos,et al. On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[24] Tomás Skopal,et al. On Fast Non-metric Similarity Search by Metric Access Methods , 2006, EDBT.

[25] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[26] Le Zhao,et al. Term necessity prediction , 2010, CIKM.

[27] Alexandr Andoni,et al. Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[28] Mihai Surdeanu,et al. The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[29] Panagiotis Papapetrou,et al. Nearest Neighbor Retrieval Using Distance-Based Hashing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[30] David G. Lowe,et al. Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31] Kristen Grauman,et al. Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32] Susan T. Dumais,et al. The vocabulary problem in human-system communication , 1987, CACM.

[33] Nicola Ferro,et al. Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings , 2016, European Conference on Information Retrieval.

[34] Yasin Abbasi-Yadkori,et al. Fast Approximate Nearest-Neighbor Search with k-Nearest Neighbor Graph , 2011, IJCAI.

[35] Ronald Fagin,et al. Static index pruning for information retrieval systems , 2001, SIGIR '01.

[36] Ion Androutsopoulos,et al. Using Centroids of Word Embeddings and Word Mover’s Distance for Biomedical Document Retrieval in Question Answering , 2016, BioNLP@ACL.

[37] Benjamin B. Kimia,et al. Metric-based shape retrieval in large databases , 2002, Object recognition supported by user interaction for service robots.

[38] Mihai Surdeanu,et al. Learning to Rank Answers to Non-Factoid Questions from Web Collections , 2011, CL.

[39] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[40] Charles L. A. Clarke,et al. Effective measures for inter-document similarity , 2013, CIKM.

[41] Peter Clark,et al. Automatic Coupling of Answer Extraction and Information Retrieval , 2013, ACL.

[42] W. Bruce Croft,et al. Finding similar questions in large question and answer archives , 2005, CIKM '05.

[43] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[44] Richard Socher,et al. A Neural Network for Factoid Question Answering over Paragraphs , 2014, EMNLP.

[45] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[46] Vibhu O. Mittal,et al. Bridging the lexical chasm: statistical approaches to answer-finding , 2000, SIGIR '00.

[47] David Konopnicki,et al. Database-Inspired Search , 2005, VLDB.

[48] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[49] Mihai Surdeanu,et al. Higher-order Lexical Semantic Models for Non-factoid Answer Reranking , 2015, TACL.

[50] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[51] Jing Wang,et al. Fast Neighborhood Graph Search Using Cartesian Concatenation , 2013, 2013 IEEE International Conference on Computer Vision.

[52] Ricardo A. Baeza-Yates,et al. Searching in metric spaces , 2001, CSUR.

[53] Jun Sakuma,et al. Fast approximate similarity search in extremely high-dimensional data sets , 2005, 21st International Conference on Data Engineering (ICDE'05).

[54] Heng Tao Shen,et al. Hashing for Similarity Search: A Survey , 2014, ArXiv.

[55] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[56] Kevin Gimpel,et al. Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[57] Michael Stonebraker,et al. H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..