Evaluating Interpolation and Extrapolation Performance of Neural Retrieval Models

A retrieval model should not only interpolate the training data but also extrapolate well to the queries that are different from the training data. While neural retrieval models have demonstrated impressive performance on ad-hoc search benchmarks, we still know little about how they perform in terms of interpolation and extrapolation. In this paper, we demonstrate the importance of separately evaluating the two capabilities of neural retrieval models. Firstly, we examine existing ad-hoc search benchmarks from the two perspectives. We investigate the distribution of training and test data and find a considerable overlap in query entities, query intent, and relevance labels. This finding implies that the evaluation on these test sets is biased toward interpolation and cannot accurately reflect the extrapolation capacity. Secondly, we propose a novel evaluation protocol to separately evaluate the interpolation and extrapolation performance on existing benchmark datasets. It resamples the training and test data based on query similarity and utilizes the resampled dataset for training and evaluation. Finally, we leverage the proposed evaluation protocol to comprehensively revisit a number of widely-adopted neural retrieval models. Results show models perform differently when moving from interpolation to extrapolation. For example, representation-based retrieval models perform almost as well as interaction-based retrieval models in terms of interpolation but not extrapolation. Therefore, it is necessary to separately evaluate both interpolation and extrapolation performance and the proposed resampling method serves as a simple yet effective evaluation tool for future IR studies.

[1]  A. Glielmo,et al.  Exploring the robust extrapolation of high-dimensional machine learning potentials , 2021, Physical Review B.

[2]  M. Zaharia,et al.  ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , 2021, NAACL.

[3]  Ashwin Srinivasan,et al.  Zero-Shot Dense Retrieval with Momentum Adversarial Domain Invariant Representations , 2021, FINDINGS.

[4]  Jiafeng Guo,et al.  Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval , 2021, WSDM.

[5]  Luyu Gao,et al.  Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval , 2021, ACL.

[6]  Pradeep Ravikumar,et al.  An Online Learning Approach to Interpolation and Extrapolation in Domain Generalization , 2021, AISTATS.

[7]  Yann LeCun,et al.  Learning in High Dimension Always Amounts to Extrapolation , 2021, ArXiv.

[8]  Benjamin Piwowarski,et al.  SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval , 2021, ArXiv.

[9]  Danqi Chen,et al.  Simple Entity-Centric Questions Challenge Dense Retrievers , 2021, EMNLP.

[10]  Benjamin Piwowarski,et al.  SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking , 2021, SIGIR.

[11]  Jimmy J. Lin,et al.  A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques , 2021, ArXiv.

[12]  Shuaiqiang Wang,et al.  Pre-trained Language Model for Web-scale Retrieval in Baidu Search , 2021, KDD.

[13]  Jimmy J. Lin,et al.  MS MARCO: Benchmarking Ranking Models in the Large-Data Regime , 2021, SIGIR.

[14]  Torsten Suel,et al.  Learning Passage Impacts for Inverted Indexes , 2021, SIGIR.

[15]  Iryna Gurevych,et al.  BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[16]  Jamie Callan,et al.  Condenser: a Pre-training Architecture for Dense Retrieval , 2021, EMNLP.

[17]  Jiafeng Guo,et al.  Optimizing Dense Retrieval Model Training with Hard Negatives , 2021, SIGIR.

[18]  Jimmy J. Lin,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[19]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[20]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[21]  Ken-ichi Kawarabayashi,et al.  How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks , 2020, ICLR.

[22]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[23]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[24]  Paul N. Bennett,et al.  Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder , 2021, EMNLP.

[25]  Jannis Bulian,et al.  CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims , 2020, ArXiv.

[26]  Jimmy J. Lin,et al.  Distilling Dense Representations for Ranking using Tightly-Coupled Teachers , 2020, ArXiv.

[27]  Jonathan D. Cohen,et al.  Learning Representations that Support Extrapolation , 2020, ICML.

[28]  Giovanni Squillero,et al.  Modeling Generalization in Machine Learning: A Methodological and Computational Study , 2020, ArXiv.

[29]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[30]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[31]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[32]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[33]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[34]  Zhicheng Dou,et al.  Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task , 2020 .

[35]  Matthias Hagen,et al.  Overview of Touché 2020: Argument Retrieval , 2020, CLEF.

[36]  Tatsunori B. Hashimoto,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[37]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[38]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Jimmy J. Lin,et al.  Anserini: Reproducible Ranking Baselines Using Lucene , 2018, ACM J. Data Inf. Qual..

[41]  Benno Stein,et al.  Retrieval of the Best Counterargument without Prior Topic Knowledge , 2018, ACL.

[42]  Paul Thomas,et al.  Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure , 2018, SIGIR.

[43]  André Freitas,et al.  WWW'18 Open Challenge: Financial Opinion Mining and Question Answering , 2018, WWW.

[44]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[45]  Fan Zhang,et al.  Evaluating Web Search with a Bejeweled Player Model , 2017, SIGIR.

[46]  Cheng Luo,et al.  Overview of the NTCIR-13 We Want Web Task , 2017, NTCIR.

[47]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[48]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[49]  Peter Bailey,et al.  User Variability and IR System Evaluation , 2015, SIGIR.

[50]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[51]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[52]  L.F.A. Wessels,et al.  Extrapolation and interpolation in neural network classifiers , 1992, IEEE Control Systems.

[53]  P. J. Haley,et al.  Extrapolation limitations of multilayer feedforward neural networks , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[54]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[55]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[56]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .