Synthetic Cross-language Information Retrieval Training Data

A key stumbling block for neural cross-language information retrieval (CLIR) systems has been the paucity of training data. The appearance of the MS MARCO monolingual training set led to significant advances in the state of the art in neural monolingual retrieval. By translating the MS MARCO documents into other languages using machine translation, this resource has been made useful to the CLIR community. Yet such translation suffers from a number of problems. While MS MARCO is a large resource, it is of fixed size; its genre and domain of discourse are fixed; and the translated documents are not written in the language of a native speaker of the language, but rather in translationese. To address these problems, we introduce the JH-POLO CLIR training set creation methodology. The approach begins by selecting a pair of non-English passages. A generative large language model is then used to produce an English query for which the first passage is relevant and the second passage is not relevant. By repeating this process, collections of arbitrary size can be created in the style of MS MARCO but using naturally-occurring documents in any desired genre and domain of discourse. This paper describes the methodology in detail, shows its use in creating new CLIR training sets, and describes experiments using the newly created training data.

[1]  D. Oard,et al.  Overview of the TREC 2022 NeuCLIR Track , 2023, ArXiv.

[2]  Eric Nyberg,et al.  InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers , 2023, ArXiv.

[3]  Rodrigo Nogueira,et al.  InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval , 2023, ArXiv.

[4]  Jimmy J. Lin,et al.  Precise Zero-Shot Dense Retrieval without Relevance Labels , 2022, ACL.

[5]  Douglas W. Oard,et al.  Parameter-efficient Zero-shot Transfer for Cross-Language Dense Retrieval with Adapters , 2022, ArXiv.

[6]  Omer Levy,et al.  Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , 2022, ACL.

[7]  Jamie Callan,et al.  Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer , 2022, EMNLP.

[8]  Nandan Thakur,et al.  Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages , 2022, ArXiv.

[9]  Jordan L. Boyd-Graber,et al.  Prompting GPT-3 To Be Reliable , 2022, ICLR.

[10]  Amir Saffari,et al.  CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing , 2022, AACL.

[11]  Keith B. Hall,et al.  Promptagator: Few-shot Dense Retrieval From 8 Examples , 2022, ICLR.

[12]  Dan Iter,et al.  Generate rather than Retrieve: Large Language Models are Strong Context Generators , 2022, ICLR.

[13]  Jane A. Yu,et al.  Few-shot Learning with Retrieval Augmented Language Models , 2022, J. Mach. Learn. Res..

[14]  Eric Michael Smith,et al.  BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage , 2022, ArXiv.

[15]  Dragomir R. Radev,et al.  RealTime QA: What's the Answer Right Now? , 2022, NeurIPS.

[16]  Md. Faisal Mahbub Chowdhury,et al.  Re2G: Retrieve, Rerank, Generate , 2022, NAACL.

[17]  Douglas W. Oard,et al.  C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval , 2022, SIGIR.

[18]  Rodrigo Nogueira,et al.  InPars: Data Augmentation for Information Retrieval using Large Language Models , 2022, ArXiv.

[19]  Douglas W. Oard,et al.  HC4: A New Suite of Test Collections for Ad Hoc CLIR , 2022, ECIR.

[20]  Kevin Duh,et al.  Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models , 2022, ECIR.

[21]  Simone Paolo Ponzetto,et al.  On cross-lingual retrieval with multilingual text encoders , 2021, Information Retrieval Journal.

[22]  Md Arafat Sultan,et al.  Learning Cross-Lingual IR from an English Retriever , 2021, NAACL.

[23]  Christopher D. Manning,et al.  You Only Need One Model for Open-domain Question Answering , 2021, EMNLP.

[24]  Nils Reimers,et al.  GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval , 2021, NAACL.

[25]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[26]  Rodrigo Nogueira,et al.  mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset , 2021, 2108.13897.

[27]  Jimmy J. Lin,et al.  Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval , 2021, MRL.

[28]  Timo Schick,et al.  Generating Datasets with Pretrained Language Models , 2021, EMNLP.

[29]  Hinrich Schütze,et al.  Few-Shot Text Generation with Pattern-Exploiting Training , 2020, ArXiv.

[30]  Kevin Duh,et al.  CLIRMatrix: A Massively Large Collection of Bilingual and Multilingual Datasets for Cross-Lingual Information Retrieval , 2020, EMNLP.

[31]  Kenton Lee,et al.  XOR QA: Cross-lingual Open-Retrieval Question Answering , 2020, NAACL.

[32]  Allan Hanbury,et al.  Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation , 2020, ArXiv.

[33]  Jason Weston,et al.  Multi-Modal Open-Domain Dialogue , 2020, EMNLP.

[34]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[35]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[36]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[37]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[38]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[39]  James Allan,et al.  A Study of Neural Matching Models for Cross-lingual IR , 2020, SIGIR.

[40]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[41]  John Glover,et al.  A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal , 2020, ACL.

[42]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[43]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[44]  Eneko Agirre,et al.  Translation Artifacts in Cross-lingual Transfer Learning , 2020, EMNLP.

[45]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[46]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[47]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[48]  Andreas Vlachos,et al.  Automated Fact Checking: Task Formulations, Methods and Future Directions , 2018, COLING.

[49]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[50]  Paolo Rosso,et al.  Continuous space models for CLIR , 2017, Inf. Process. Manag..

[51]  Jianfeng Gao,et al.  MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[52]  Shuly Wintner,et al.  On the features of translationese , 2015, Digit. Scholarsh. Humanit..

[53]  Dong Zhou,et al.  Translation techniques in cross-language information retrieval , 2012, CSUR.

[54]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[55]  Jimmy J. Lin,et al.  Cross-Lingual Training of Dense Retrievers for Document Retrieval , 2021, MRL.

[56]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[57]  Bo Li,et al.  Learning Neural Representation for CLIR with Adversarial Framework , 2018, EMNLP.