Inducing Language-Agnostic Multilingual Representations

Multilingual representations have the potential to make cross-lingual systems available to the vast majority of languages in the world. However, they currently require large pretraining corpora, or assume access to typologically similar languages. In this work, we address these obstacles by removing language identity signals from multilingual embeddings. We examine three approaches for this: 1) re-aligning the vector spaces of target languages (all together) to a pivot source language; 2) removing languages-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and 3) normalizing input texts by removing morphological contractions and sentence reordering, thus yielding language-agnostic representations. We evaluate on the tasks of XNLI and reference-free MT evaluation of varying difficulty across 19 selected languages. Our experiments demonstrate the language-agnostic behavior of our multilingual representations, which manifest the potential of zero-shot cross-lingual transfer to distant and low-resource languages, and decrease the performance gap by 8.9 points (M-BERT) and 18.2 points (XLM-R) on average across all tasks and languages. We make our codes and models available.

[1]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[2]  Jörg Tiedemann,et al.  What Do Language Representations Really Represent? , 2019, Computational Linguistics.

[3]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[4]  Thierry Poibeau,et al.  Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing , 2018, Computational Linguistics.

[5]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[6]  Iryna Gurevych,et al.  Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations , 2018, 1803.01400.

[7]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[8]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[9]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[10]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[11]  Mona T. Diab,et al.  Context-Aware Cross-Lingual Mapping , 2019, NAACL.

[12]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[13]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[14]  Ryan Cotterell,et al.  Are All Languages Equally Hard to Language-Model? , 2018, NAACL.

[15]  Sampo Pyysalo,et al.  Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[16]  Ryan Cotterell,et al.  A Probabilistic Generative Model of Linguistic Typology , 2019, NAACL.

[17]  Veselin Stoyanov,et al.  Emerging Cross-lingual Structure in Pretrained Language Models , 2020, ACL.

[18]  Dan Klein,et al.  Multilingual Alignment of Contextual Word Representations , 2020, ICLR.

[19]  Isabelle Augenstein,et al.  From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings , 2018, NAACL-HLT.

[20]  Alexander M. Fraser,et al.  How Language-Neutral is Multilingual BERT? , 2019, ArXiv.

[21]  Ivan Vulic,et al.  Survey on the Use of Typological Information in Natural Language Processing , 2016, COLING.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Regina Barzilay,et al.  Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing , 2019, NAACL.

[24]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[25]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[26]  Goran Glavas,et al.  From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers , 2020, ArXiv.

[27]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[28]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[29]  Isabelle Augenstein,et al.  Parameter sharing between dependency parsers for related languages , 2018, EMNLP.

[30]  Jason Eisner,et al.  The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages , 2016, TACL.

[31]  Wei Zhao,et al.  On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation , 2020, ACL.

[32]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[33]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[34]  Rudolf Rosa,et al.  On the Language Neutrality of Pre-trained Multilingual Representations , 2020, EMNLP.

[35]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[36]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.