nmT5 - Is parallel data still relevant for pre-training massively multilingual language models?

Recently, mT5 a massively multilingual version of T5 leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pretraining is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime.

[1]  Rico Sennrich,et al.  Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , 2020, ACL.

[2]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[3]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[4]  Graham Neubig,et al.  Explicit Alignment Objectives for Multilingual Bidirectional Encoders , 2021, NAACL.

[5]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[6]  Luo Si,et al.  VECO: Variable Encoder-decoder Pre-training for Cross-lingual Understanding and Generation , 2020, ArXiv.

[7]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[8]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[9]  Hao Tian,et al.  ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora , 2021, EMNLP.

[10]  Zhe Gan,et al.  FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding , 2021, AAAI.

[11]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[12]  Heng Ji,et al.  Cross-lingual Joint Entity and Word Embedding to Improve Entity Linking and Parallel Sentence Mining , 2019, DeepLo@EMNLP-IJCNLP.

[13]  Claire Cardie,et al.  WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization , 2020, FINDINGS.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Haoran Li,et al.  MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark , 2020, EACL.

[16]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[17]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[18]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[19]  Ankur Bapna,et al.  Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation , 2020, AAAI.

[20]  Shuangzhi Wu,et al.  Alternating Language Modeling for Cross-Lingual Pre-Training , 2020, AAAI.

[21]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[22]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[23]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .