On cross-lingual retrieval with multilingual text encoders

Pretrained multilingual text encoders based on neural transformer architectures, such as multilingual BERT (mBERT) and XLM, have recently become a default paradigm for cross-lingual transfer of natural language processing models, rendering cross-lingual word embedding spaces (CLWEs) effectively obsolete. In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentenceand document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR – a setup with no relevance judgments for IR-specific fine-tuning – pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather than using their vanilla ‘off-the-shelf’ variants. Following these results, we introduce localized relevance matching for document-level CLIR, where we independently score a query against document sections. In the second part, we evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments. Our results show that, despite the supervision, and due to the domain and language shift, supervised re-ranking rarely improves the performance of multilingual transformers as unsupervised base rankers. Finally, only with in-domain contrastive fine-tuning (i.e., same domain, only language transfer), we manage to improve the ranking quality. We uncover substantial Robert Litschko, Simone Paolo Ponzetto and Goran Glavaš University of Mannheim E-mail: {litschko,simone,goran}@informatik.uni-mannheim.de Ivan Vulić Language Technology Lab, University of Cambridge E-mail: iv250@cam.ac.uk ar X iv :2 11 2. 11 03 1v 1 [ cs .C L ] 2 1 D ec 2 02 1 2 Robert Litschko et al. empirical differences between cross-lingual retrieval results and results of (zero-shot) cross-lingual transfer for monolingual retrieval in target languages, which point to “monolingual overfitting” of retrieval models trained on monolingual (English) data, even if they are based on multilingual transformers.

[2]  Nazli Goharian,et al.  Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning , 2020, ECIR.

[3]  Qianchu Liu,et al.  Investigating Cross-Lingual Alignment Methods for Contextualized Embeddings with Token-Level Evaluation , 2019, CoNLL.

[4]  Jimmy J. Lin,et al.  MS MARCO: Benchmarking Ranking Models in the Large-Data Regime , 2021, SIGIR.

[5]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[6]  Jason Weston,et al.  Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2020, ICLR.

[7]  Luyu Gao,et al.  Modularized Transfomer-based Ranking Framework , 2020, EMNLP.

[8]  Simone Paolo Ponzetto,et al.  Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval , 2021, ECIR.

[9]  Goran Glavas,et al.  Evaluating Resource-Lean Cross-Lingual Embedding Models in Unsupervised Retrieval , 2019, SIGIR.

[10]  Isabelle Augenstein,et al.  Inducing Language-Agnostic Multilingual Representations , 2021, STARSEM.

[11]  Jimmy J. Lin,et al.  Multi-Stage Document Ranking with BERT , 2019, ArXiv.

[12]  Naveen Arivazhagan,et al.  Language-agnostic BERT Sentence Embedding , 2020, ArXiv.

[13]  Nigel Collier,et al.  Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders , 2021, EMNLP.

[14]  Keith Stevens,et al.  Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , 2018, WMT.

[15]  Goran Glavas,et al.  How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions , 2019, ACL.

[16]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[17]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[18]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[19]  Nazli Goharian,et al.  CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[20]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[21]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[22]  Dan Klein,et al.  Multilingual Alignment of Contextual Word Representations , 2020, ICLR.

[23]  Yingfei Sun,et al.  PARADE: Passage Representation Aggregation forDocument Reranking , 2020, ACM Trans. Inf. Syst..

[24]  Anna Korhonen,et al.  A Closer Look at Few-Shot Crosslingual Transfer: The Choice of Shots Matters , 2020, ACL.

[25]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[26]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[27]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[28]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  Ray Kurzweil,et al.  Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax , 2019, IJCAI.

[31]  Jamie Callan,et al.  Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[32]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[33]  Iryna Gurevych,et al.  Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks , 2021, NAACL.

[34]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[35]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[36]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[37]  Matt J. Kusner,et al.  A Survey on Contextual Embeddings , 2020, ArXiv.

[38]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[39]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[40]  Goran Glavaš,et al.  From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers , 2020, EMNLP.

[41]  Fan Yang,et al.  XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation , 2020, EMNLP.

[42]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[43]  Wei Zhao,et al.  On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation , 2020, ACL.

[44]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[45]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[46]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[47]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[48]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[49]  Qianchu Liu,et al.  XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.

[50]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[51]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[52]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[53]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[54]  Jimmy J. Lin,et al.  Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval , 2019, EMNLP.

[55]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[56]  Jimmy J. Lin,et al.  Cross-Lingual Training with Dense Retrieval for Document Retrieval , 2021, ArXiv.

[57]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[58]  James Allan,et al.  A Study of Neural Matching Models for Cross-lingual IR , 2020, SIGIR.

[59]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[60]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[61]  William Hartmann,et al.  Cross-lingual Information Retrieval with BERT , 2020, CLSSTS.

[62]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[63]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[64]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[65]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[66]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[67]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[68]  Goran Glavas,et al.  Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? , 2019, EMNLP.

[69]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[70]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[71]  Goran Glavas,et al.  Probing Pretrained Language Models for Lexical Semantics , 2020, EMNLP.

[72]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[73]  Pierre Zweigenbaum,et al.  Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[74]  Ray Kurzweil,et al.  Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model , 2019, RepL4NLP@ACL.

[75]  G. Sze,et al.  Hypophysitis: endocrinologic and dynamic MR findings. , 1998, AJNR. American journal of neuroradiology.

[76]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[77]  Nazli Goharian,et al.  SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search , 2020, EMNLP.

[78]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[79]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[80]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[81]  Jimmy J. Lin,et al.  Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval , 2021, MRL.

[82]  Timothy Baldwin,et al.  CQADupStack: A Benchmark Data Set for Community Question-Answering Research , 2015, ADCS.

[83]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[84]  Alessandro Moschitti,et al.  Semi-supervised Question Retrieval with Gated Convolutions , 2015, NAACL.