Identifying Necessary Elements for BERT's Multilinguality

It has been shown that multilingual BERT (mBERT) yields high quality multilingual rep- resentations and enables effective zero-shot transfer. This is suprising given that mBERT does not use any kind of crosslingual sig- nal during training. While recent literature has studied this effect, the exact reason for mBERT’s multilinguality is still unknown. We aim to identify architectural properties of BERT as well as linguistic properties of lan- guages that are necessary for BERT to become multilingual. To allow for fast experimenta- tion we propose an efficient setup with small BERT models and synthetic as well as natu- ral data. Overall, we identify six elements that are potentially necessary for BERT to be mul- tilingual. Architectural factors that contribute to multilinguality are underparameterization, shared special tokens (e.g., “[CLS]”), shared position embeddings and replacing masked to- kens with random tokens. Factors related to training data that are beneficial for multilin- guality are similar word order and comparabil- ity of corpora.

[1]  Veselin Stoyanov,et al.  Emerging Cross-lingual Structure in Pretrained Language Models , 2020, ACL.

[2]  Dan Klein,et al.  Multilingual Alignment of Contextual Word Representations , 2020, ICLR.

[3]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[4]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[5]  Goran Glavas,et al.  Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? , 2019, EMNLP.

[6]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[7]  Iryna Gurevych,et al.  MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer , 2020, EMNLP.

[8]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[9]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[10]  Alexander M. Fraser,et al.  How Language-Neutral is Multilingual BERT? , 2019, ArXiv.

[11]  Jungo Kasai,et al.  Polyglot Contextual Representations Improve Crosslingual Transfer , 2019, NAACL.

[12]  Ming Zhou,et al.  InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training , 2020, NAACL.

[13]  Alexandra Birch,et al.  Reordering Metrics for MT , 2011, ACL.

[14]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[15]  Richard Socher,et al.  BERT is Not an Interlingua and the Bias of Tokenization , 2019, EMNLP.

[16]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[17]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[18]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[19]  Hinrich Schutze,et al.  SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings , 2020, EMNLP.

[20]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[21]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[22]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[23]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[24]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[25]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[26]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[27]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[28]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[29]  Tapio Salakoski,et al.  Is Multilingual BERT Fluent in Language Generation? , 2019, ArXiv.

[30]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[31]  Thomas Mayer,et al.  Creating a massively parallel Bible corpus , 2014, LREC.

[32]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[33]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[34]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Goran Glavas,et al.  From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers , 2020, ArXiv.

[37]  Manaal Faruqui,et al.  Improving Vector Space Word Representations Using Multilingual Correlation , 2014, EACL.

[38]  Wei Zhao,et al.  On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation , 2020, ACL.

[39]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[40]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.