论文信息 - On cross-lingual retrieval with multilingual text encoders - 字舞流文

On cross-lingual retrieval with multilingual text encoders

Pretrained multilingual text encoders based on neural transformer architectures, such as multilingual BERT (mBERT) and XLM, have recently become a default paradigm for cross-lingual transfer of natural language processing models, rendering cross-lingual word embedding spaces (CLWEs) effectively obsolete. In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentenceand document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR – a setup with no relevance judgments for IR-specific fine-tuning – pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather than using their vanilla ‘off-the-shelf’ variants. Following these results, we introduce localized relevance matching for document-level CLIR, where we independently score a query against document sections. In the second part, we evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments. Our results show that, despite the supervision, and due to the domain and language shift, supervised re-ranking rarely improves the performance of multilingual transformers as unsupervised base rankers. Finally, only with in-domain contrastive fine-tuning (i.e., same domain, only language transfer), we manage to improve the ranking quality. We uncover substantial Robert Litschko, Simone Paolo Ponzetto and Goran Glavaš University of Mannheim E-mail: {litschko,simone,goran}@informatik.uni-mannheim.de Ivan Vulić Language Technology Lab, University of Cambridge E-mail: iv250@cam.ac.uk ar X iv :2 11 2. 11 03 1v 1 [ cs .C L ] 2 1 D ec 2 02 1 2 Robert Litschko et al. empirical differences between cross-lingual retrieval results and results of (zero-shot) cross-lingual transfer for monolingual retrieval in target languages, which point to “monolingual overfitting” of retrieval models trained on monolingual (English) data, even if they are based on multilingual transformers.

Simone Paolo Ponzetto | Goran Glavavs | Ivan Vuli'c | Robert Litschko | Ivan Vulic | Robert Litschko | Goran Glavavs

[2] Nazli Goharian,et al. Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning , 2020, ECIR.

[3] Qianchu Liu,et al. Investigating Cross-Lingual Alignment Methods for Contextualized Embeddings with Token-Level Evaluation , 2019, CoNLL.

[4] Jimmy J. Lin,et al. MS MARCO: Benchmarking Ranking Models in the Large-Data Regime , 2021, SIGIR.

[5] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[6] Jason Weston,et al. Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2020, ICLR.

[7] Luyu Gao,et al. Modularized Transfomer-based Ranking Framework , 2020, EMNLP.

[8] Simone Paolo Ponzetto,et al. Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval , 2021, ECIR.

[9] Goran Glavas,et al. Evaluating Resource-Lean Cross-Lingual Embedding Models in Unsupervised Retrieval , 2019, SIGIR.

[10] Isabelle Augenstein,et al. Inducing Language-Agnostic Multilingual Representations , 2021, STARSEM.

[11] Jimmy J. Lin,et al. Multi-Stage Document Ranking with BERT , 2019, ArXiv.

[12] Naveen Arivazhagan,et al. Language-agnostic BERT Sentence Embedding , 2020, ArXiv.

[13] Nigel Collier,et al. Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders , 2021, EMNLP.

[14] Keith Stevens,et al. Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , 2018, WMT.

[15] Goran Glavas,et al. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions , 2019, ACL.

[16] Ellen M. Voorhees,et al. Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[17] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[18] Marcin Junczys-Dowmunt,et al. The United Nations Parallel Corpus v1.0 , 2016, LREC.

[19] Nazli Goharian,et al. CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[20] Douwe Kiela,et al. SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[21] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[22] Dan Klein,et al. Multilingual Alignment of Contextual Word Representations , 2020, ICLR.

[23] Yingfei Sun,et al. PARADE: Passage Representation Aggregation forDocument Reranking , 2020, ACM Trans. Inf. Syst..

[24] Anna Korhonen,et al. A Closer Look at Few-Shot Crosslingual Transfer: The Choice of Shots Matters , 2020, ACL.

[25] Marie-Francine Moens,et al. Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[26] Anders Søgaard,et al. A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[27] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[28] Kawin Ethayarajh,et al. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[29] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[30] Ray Kurzweil,et al. Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax , 2019, IJCAI.

[31] Jamie Callan,et al. Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[32] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[33] Iryna Gurevych,et al. Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks , 2021, NAACL.

[34] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[35] W. Bruce Croft,et al. A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[36] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[37] Matt J. Kusner,et al. A Survey on Contextual Embeddings , 2020, ArXiv.

[38] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[39] Iryna Gurevych,et al. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[40] Goran Glavaš,et al. From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers , 2020, EMNLP.

[41] Fan Yang,et al. XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation , 2020, EMNLP.

[42] Graham Neubig,et al. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[43] Wei Zhao,et al. On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation , 2020, ACL.

[44] M. Zaharia,et al. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[45] Eva Schlinger,et al. How Multilingual is Multilingual BERT? , 2019, ACL.

[46] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[47] Jianfeng Gao,et al. A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[48] Ray Kurzweil,et al. Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[49] Qianchu Liu,et al. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.

[50] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[51] Nan Hua,et al. Universal Sentence Encoder for English , 2018, EMNLP.

[52] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[53] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[54] Jimmy J. Lin,et al. Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval , 2019, EMNLP.

[55] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[56] Jimmy J. Lin,et al. Cross-Lingual Training with Dense Retrieval for Document Retrieval , 2021, ArXiv.

[57] Eneko Agirre,et al. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[58] James Allan,et al. A Study of Neural Matching Models for Cross-lingual IR , 2020, SIGIR.

[59] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[60] Hua Wu,et al. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[61] William Hartmann,et al. Cross-lingual Information Retrieval with BERT , 2020, CLSSTS.

[62] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[63] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[64] Mark Dredze,et al. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[65] Samuel L. Smith,et al. Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[66] Quoc V. Le,et al. Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[67] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[68] Goran Glavas,et al. Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? , 2019, EMNLP.

[69] Holger Schwenk,et al. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[70] CHENGXIANG ZHAI,et al. A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[71] Goran Glavas,et al. Probing Pretrained Language Models for Lexical Semantics , 2020, EMNLP.

[72] P. Schönemann,et al. A generalized solution of the orthogonal procrustes problem , 1966 .

[73] Pierre Zweigenbaum,et al. Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[74] Ray Kurzweil,et al. Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model , 2019, RepL4NLP@ACL.

[75] G. Sze,et al. Hypophysitis: endocrinologic and dynamic MR findings. , 1998, AJNR. American journal of neuroradiology.

[76] Dan Roth,et al. Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[77] Nazli Goharian,et al. SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search , 2020, EMNLP.

[78] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[79] Jimmy J. Lin,et al. Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[80] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[81] Jimmy J. Lin,et al. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval , 2021, MRL.

[82] Timothy Baldwin,et al. CQADupStack: A Benchmark Data Set for Community Question-Answering Research , 2015, ADCS.

[83] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[84] Alessandro Moschitti,et al. Semi-supervised Question Retrieval with Gated Convolutions , 2015, NAACL.