Discovering Representation Sprachbund For Multilingual Pre-Training

Multilingual pre-trained models have demonstrated their effectiveness in many multilingual NLP tasks and enabled zero-shot or few-shot transfer from high-resource languages to lowresource ones. However, due to significant typological differences and contradictions between some languages, such models usually perform poorly on many languages and crosslingual settings, which shows the difficulty of learning a single model to handle massive diverse languages well at the same time. To alleviate this issue, we present a new multilingual pre-training pipeline. We propose to generate language representation from multilingual pretrained models and conduct linguistic analysis to show that language representation similarity reflect linguistic similarity from multiple perspectives, including language family, geographical sprachbund, lexicostatistics and syntax. Then we cluster all the target languages into multiple groups and name each group as a representation sprachbund. Thus, languages in the same representation sprachbund are supposed to boost each other in both pre-training and fine-tuning as they share rich linguistic similarity. We pre-train one multilingual model for each representation sprachbund. Experiments are conducted on crosslingual benchmarks and significant improvements are achieved compared to strong baselines.

[1]  Jason Baldridge,et al.  PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[2]  Koji Ohnishi Sound-correspondence laws of word-initial consonants between proto-Indo-European and Austronesian languages , 2009, Artificial Life and Robotics.

[3]  Ekaterina Shutova,et al.  What does it mean to be language-agnostic? Probing multilingual sentence encoders for typological properties , 2020, ArXiv.

[4]  A. Dil,et al.  Language and Linguistic Area: Essays by Murray B. Emmeneau , 1981 .

[5]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[6]  Jörg Tiedemann,et al.  Continuous multilinguality with language vectors , 2016, EACL.

[7]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[8]  Graham Neubig,et al.  Choosing Transfer Languages for Cross-Lingual Learning , 2019, ACL.

[9]  Xiaodong Fan,et al.  XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation , 2020, EMNLP.

[10]  Jinlan Fu,et al.  XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation , 2021, EMNLP.

[11]  Holger Schwenk,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[12]  Tao Qin,et al.  Multilingual Neural Machine Translation with Language Clustering , 2019, EMNLP.

[13]  Morphology: a study of the relation between meaning and form: Joan L. Bybee, (Typological Studies in Language 9.) Benjamins, Amsterdam; 1985. vii+234pp , 1986 .

[14]  D. Hymes,et al.  Lexicostatistics So Far , 1960, Current Anthropology.

[15]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[16]  Tetsuji Nakagawa,et al.  An Empirical Study of Language Relatedness for Transfer Learning in Neural Machine Translation , 2017, PACLIC.

[17]  Hyung Won Chung,et al.  Improving Multilingual Models with Language-Clustered Vocabularies , 2020, EMNLP.

[18]  Yuji Matsumoto,et al.  Universal Dependencies 2.1 , 2017 .

[19]  Jaime G. Carbonell,et al.  Characterizing and Avoiding Negative Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[21]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[22]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[23]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[24]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[25]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[26]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[27]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[28]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[29]  B. L. Whorf The Relation of Habitual Thought and Behavior to Language , 1997 .

[30]  M. B. Emeneau India as a Lingustic Area , 1956 .

[31]  Orhan Firat,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[32]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[33]  Jörg Tiedemann,et al.  Emerging Language Spaces Learned From Massively Multilingual Corpora , 2018, DHN.

[34]  Angela Marcantonio,et al.  The Uralic Language Family: Facts, Myths and Statistics , 2002 .

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Patrick Littell,et al.  URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , 2017, EACL.

[37]  M. Durbin 8. A Survey of the Carib Language Family , 1985, South American Indian Languages.

[38]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[39]  Kenji Sagae,et al.  Language Embeddings for Typology and Cross-lingual Transfer Learning , 2021, ACL.

[40]  Ankur Bapna,et al.  Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.