Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X-Mod) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.

[1]  Ahmet Ustun,et al.  Multilingual Unsupervised Neural Machine Translation with Denoising Adapters , 2021, EMNLP.

[2]  Vassilina Nikoulina,et al.  Multilingual Domain Adaptation for NMT: Decoupling Language and Domain Information with Adapters , 2021, WMT.

[3]  A. Korhonen,et al.  Composable Sparse Fine-Tuning for Cross-Lingual Transfer , 2021, ACL.

[4]  Yulia Tsvetkov,et al.  Efficient Test Time Adapter Ensembling for Low-resource Language Varieties , 2021, EMNLP.

[5]  Andrei Popescu-Belis,et al.  Subword Mapping and Anchoring across Languages , 2021, EMNLP.

[6]  Hyung Won Chung,et al.  Charformer: Fast Character Transformers via Gradient-based Subword Tokenization , 2021, ICLR.

[7]  Sebastian Ruder,et al.  Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks , 2021, ACL.

[8]  Didier Schwab,et al.  Lightweight Adapter Tuning for Multilingual Speech Translation , 2021, ACL.

[9]  Rami Al-Rfou,et al.  ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[10]  Ngoc Thang Vu,et al.  AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages , 2021, ACL.

[11]  Iryna Gurevych,et al.  What to Pre-Train on? Efficient Intermediate Task Selection , 2021, EMNLP.

[12]  Dan Garrette,et al.  Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , 2021, TACL.

[13]  Iryna Gurevych,et al.  UNKs Everywhere: Adapting Multilingual Language Models to New Scripts , 2020, EMNLP.

[14]  Iryna Gurevych,et al.  How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , 2020, ACL.

[15]  Goran Glavas,et al.  Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer , 2020, ArXiv.

[16]  Matthias Gallé,et al.  Language Adapters for Zero Shot Neural Machine Translation , 2020, EMNLP.

[17]  Goran Glavaš,et al.  From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers , 2020, EMNLP.

[18]  Hinrich Schütze,et al.  Identifying Elements Essential for BERT’s Multilinguality , 2020, EMNLP.

[19]  Sebastian Ruder,et al.  Rethinking embedding coupling in pre-trained language models , 2020, ICLR.

[20]  Benoit Sagot,et al.  When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models , 2020, NAACL.

[21]  Hyung Won Chung,et al.  Improving Multilingual Models with Language-Clustered Vocabularies , 2020, EMNLP.

[22]  Iryna Gurevych,et al.  AdapterDrop: On the Efficiency of Adapters in Transformers , 2020, EMNLP.

[23]  Iryna Gurevych,et al.  MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale , 2020, EMNLP.

[24]  Noah A. Smith,et al.  Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank , 2020, Findings of the Association for Computational Linguistics: EMNLP 2020.

[25]  Alexander M. Fraser,et al.  Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT , 2020, EMNLP.

[26]  Iryna Gurevych,et al.  AdapterHub: A Framework for Adapting Transformers , 2020, EMNLP.

[27]  Mark Dredze,et al.  Are All Languages Created Equal in Multilingual BERT? , 2020, REPL4NLP.

[28]  Iryna Gurevych,et al.  AdapterFusion: Non-Destructive Task Composition for Transfer Learning , 2020, EACL.

[29]  A. Korhonen,et al.  XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.

[30]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[31]  Gertjan van Noord,et al.  UDapter: Language Adaptation for Truly Universal Dependency Parsing , 2020, EMNLP.

[32]  Dan Roth,et al.  Extending Multilingual BERT to Low-Resource Languages , 2020, FINDINGS.

[33]  Xuanjing Huang,et al.  K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters , 2020, FINDINGS.

[34]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[35]  Luke Zettlemoyer,et al.  Emerging Cross-lingual Structure in Pretrained Language Models , 2019, ACL.

[36]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[37]  Holger Schwenk,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[38]  Ke M. Tran,et al.  From English To Foreign Languages: Transferring Pre-trained Language Models , 2019, ArXiv.

[39]  Ankur Bapna,et al.  Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[40]  Christopher Ré,et al.  Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices , 2019, NeurIPS.

[41]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[42]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[43]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[44]  BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning , 2019, ICML.

[45]  Trevor Cohn,et al.  Massively Multilingual Transfer for NER , 2019, ACL.

[46]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[47]  Sebastian Ruder,et al.  A survey of cross-lingual embedding models , 2017, ArXiv.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Andrea Vedaldi,et al.  Learning multiple visual domains with residual adapters , 2017, NIPS.

[50]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[51]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[52]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[53]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[54]  Joe Davison,et al.  Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , 2021, NeurIPS.

[55]  Goran Glavas,et al.  MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer , 2021, EMNLP.

[56]  Noah A. Smith,et al.  Parsing with Multilingual BERT, a Small Treebank, and a Small Corpus , 2020, FINDINGS.

[57]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[58]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[59]  Daniel Elmore,et al.  How good is it , 1998 .

[60]  Don Cherry,et al.  In all languages , 1987 .