Cross-lingual Language Model Pretraining for Retrieval

Existing research on cross-lingual retrieval cannot take good advantage of large-scale pretrained language models such as multilingual BERT and XLM. We hypothesize that the absence of cross-lingual passage-level relevance data for finetuning and the lack of query-document style pretraining are key factors of this issue. In this paper, we introduce two novel retrieval-oriented pretraining tasks to further pretrain cross-lingual language models for downstream retrieval tasks such as cross-lingual ad-hoc retrieval (CLIR) and cross-lingual question answering (CLQA). We construct distant supervision data from multilingual Wikipedia using section alignment to support retrieval-oriented language model pretraining. We also propose to directly finetune language models on part of the evaluation collection by making Transformers capable of accepting longer sequences. Experiments on multiple benchmark datasets show that our proposed model can significantly improve upon general multilingual language models in both the cross-lingual retrieval setting and the cross-lingual transfer setting.

[1]  Hervé Jégou,et al.  Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion , 2018, EMNLP.

[2]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[3]  Xiang Ji,et al.  MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching , 2019, SIGIR.

[4]  Nazli Goharian,et al.  Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning , 2020, ECIR.

[5]  Sebastian Riedel,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[6]  Gerard de Melo Inducing Conceptual Embedding Spaces from Wikipedia , 2017, WWW.

[7]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[8]  Jimmy J. Lin,et al.  Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval , 2019, EMNLP.

[9]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[10]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[11]  Jimmy J. Lin,et al.  Flat vs. hierarchical phrase-based translation models for cross-language information retrieval , 2013, SIGIR.

[12]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[13]  Veselin Stoyanov,et al.  Emerging Cross-lingual Structure in Pretrained Language Models , 2020, ACL.

[14]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[15]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[16]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[17]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[18]  Allan Hanbury,et al.  Local Self-Attention over Long Text for Efficient Document Retrieval , 2020, SIGIR.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Vasudeva Varma,et al.  Language independent identification of parallel sentences using Wikipedia , 2011, WWW.

[21]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[22]  Martin Jaggi,et al.  Crosslingual Document Embedding as Reduced-Rank Ridge Regression , 2019, WSDM.

[23]  Douglas W. Oard,et al.  Probabilistic structured query methods , 2003, SIGIR.

[24]  James Allan,et al.  A Study of Neural Matching Models for Cross-lingual IR , 2020, SIGIR.

[25]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[26]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[27]  William Hartmann,et al.  Cross-lingual Information Retrieval with BERT , 2020, CLSSTS.

[28]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Allan Hanbury,et al.  Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking , 2020, ECAI.

[32]  Goran Glavas,et al.  Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only , 2018, SIGIR.

[33]  Hongliang Fei,et al.  Cross-Lingual Unsupervised Sentiment Classification with Multi-View Transfer Learning , 2020, ACL.

[34]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[35]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[36]  Nazli Goharian,et al.  CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[37]  Jimmy J. Lin,et al.  Cross-Lingual Relevance Transfer for Document Retrieval , 2019, ArXiv.

[38]  Jimmy J. Lin,et al.  Simple Applications of BERT for Ad Hoc Document Retrieval , 2019, ArXiv.

[39]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[40]  Andy Way,et al.  Towards language-agnostic alignment of product titles and descriptions: a neural approach , 2019, WWW.

[41]  Iryna Gurevych,et al.  Improved Cross-Lingual Question Retrieval for Community Question Answering , 2019, WWW.

[42]  Marius Mosbach,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ArXiv.

[43]  Filip Radlinski,et al.  TREC Complex Answer Retrieval Overview , 2018, TREC.

[44]  Robert West,et al.  Structuring Wikipedia Articles with Section Recommendations , 2018, SIGIR.

[45]  Georgiana Dinu,et al.  Improving zero-shot learning by mitigating the hubness problem , 2014, ICLR.

[46]  Sheikh Muhammad Sarwar,et al.  Training Effective Neural CLIR by Bridging the Translation Gap , 2020, SIGIR.

[47]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[48]  Dan Klein,et al.  Multilingual Alignment of Contextual Word Representations , 2020, ICLR.

[49]  Xuanzhe Liu,et al.  Emoji-Powered Representation Learning for Cross-Lingual Sentiment Classification , 2018, WWW.

[50]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.