A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models

Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot evaluation mode and improves reliability of our results. Furthermore, since source datasets licences often prohibit commercial use, we compare transfer learning to training on pseudo-labels generated by a BM25 scorer. We find that training on pseudo-labels---possibly with subsequent fine-tuning using a modest number of annotated queries---can produce a competitive or better model compared to transfer learning. Yet, it is necessary to improve the stability and/or effectiveness of the few-shot training, which, sometimes, can degrade performance of a pretrained model.

[1]  Mónica Marrero,et al.  On the measurement of test collection reliability , 2013, SIGIR.

[2]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[3]  Marius Mosbach,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ArXiv.

[4]  W. Bruce Croft,et al.  Neural Ranking Models with Weak Supervision , 2017, SIGIR.

[5]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[6]  Jimmy J. Lin,et al.  Document Ranking with a Pretrained Sequence-to-Sequence Model , 2020, FINDINGS.

[7]  Craig MacDonald,et al.  Transferring Learning To Rank Models for Web Search , 2015, ICTIR.

[8]  Eddy Maddalena,et al.  Crowd Worker Strategies in Relevance Judgment Tasks , 2020, WSDM.

[9]  Leonid Boytsov,et al.  Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits , 2021, ECIR.

[10]  Mihai Surdeanu,et al.  Learning to Rank Answers to Non-Factoid Questions from Web Collections , 2011, CL.

[11]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[12]  Jimmy J. Lin,et al.  Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval , 2019, EMNLP.

[13]  Iryna Gurevych,et al.  MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale , 2020, EMNLP.

[14]  Jimmy Lin,et al.  A Little Bit Is Worse Than None: Ranking with Limited Training Data , 2020, SUSTAINLP.

[15]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[18]  Jimmy J. Lin,et al.  The Neural Hype and Comparisons Against Weak Baselines , 2019, SIGIR Forum.

[19]  Iryna Gurevych,et al.  BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[20]  Eric Nyberg,et al.  Flexible retrieval with NMSLIB and FlexNeuART , 2020, NLPOSS.

[21]  Jimmy J. Lin,et al.  Cross-Lingual Relevance Transfer for Document Retrieval , 2019, ArXiv.

[22]  Nazli Goharian,et al.  Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning , 2020, ECIR.

[23]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  W. Bruce Croft,et al.  A Deep Look into Neural Ranking Models for Information Retrieval , 2019, Inf. Process. Manag..

[26]  Bhaskar Mitra,et al.  An Introduction to Neural Information Retrieval , 2018, Found. Trends Inf. Retr..

[27]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[28]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[29]  Leonid Boytsov,et al.  Deciding on an adjustment for multiplicity in IR experiments , 2013, SIGIR.

[30]  Allan Hanbury,et al.  Cross-domain Retrieval in the Legal and Patent Domains: a Reproducability Study , 2020, ArXiv.

[31]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[32]  Noah Constant,et al.  MultiReQA: A Cross-Domain Evaluation forRetrieval Question Answering Models , 2020, ADAPTNLP.

[33]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[34]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[35]  Allan Hanbury,et al.  Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking , 2020, ECAI.