Cross-Lingual Training with Dense Retrieval for Document Retrieval

Dense retrieval has shown great success in passage ranking in English. However, its effectiveness in document retrieval for non-English languages remains unexplored due to the limitation in training resources. In this work, we explore different transfer techniques for document ranking from English annotations to multiple non-English languages. Our experiments on the test collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish) from diverse language families reveal that zero-shot model-based transfer using mBERT improves the search quality in non-English mono-lingual retrieval. Also, we find that weakly-supervised target language transfer yields competitive performances against the generation-based target language transfer that requires external translators and query generators.

[1]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[2]  Zhuyun Dai,et al.  Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval , 2019, ArXiv.

[3]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[4]  Fredric C. Gey,et al.  The TREC 2002 Arabic/English CLIR Track , 2002, TREC.

[5]  Garrett Bingham,et al.  Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations , 2019, ACL.

[6]  Jimmy J. Lin,et al.  Simple Applications of BERT for Ad Hoc Document Retrieval , 2019, ArXiv.

[7]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[8]  James Allan,et al.  A Study of Neural Matching Models for Cross-lingual IR , 2020, SIGIR.

[10]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[11]  William Hartmann,et al.  Cross-lingual Information Retrieval with BERT , 2020, CLSSTS.

[12]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[13]  Jamie Callan,et al.  Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[14]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[15]  Nazli Goharian,et al.  Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning , 2020, ECIR.

[16]  Douglas W. Oard,et al.  Cross-language Sentence Selection via Data Augmentation and Rationale Training , 2021, ACL.

[17]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[18]  Ping Li,et al.  Cross-lingual Language Model Pretraining for Retrieval , 2021, WWW.

[19]  Carol Peters,et al.  CLEF 2006: Ad Hoc Track Overview , 2006, CLEF.

[20]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[21]  Noriko Kando,et al.  Overview of the NTCIR-7 ACLIA Tasks: Advanced Cross-Lingual Information Access , 2008, NTCIR.

[22]  Jimmy J. Lin,et al.  Applying BERT to Document Retrieval with Birch , 2019, EMNLP.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Jimmy J. Lin,et al.  Cross-Lingual Relevance Transfer for Document Retrieval , 2019, ArXiv.

[25]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[26]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[27]  Li Dong,et al.  Cross-Lingual Natural Language Generation via Pre-Training , 2020, AAAI.

[28]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[29]  Simone Paolo Ponzetto,et al.  Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval , 2021, ECIR.

[30]  Sukomal Pal,et al.  FIRE-2012 Adhoc Retrieval Task and Morpheme Extraction Task , 2012 .

[31]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.