CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

Multi-lingual contextualized embeddings, such as multilingual-BERT (mBERT), have shown success in a variety of zero-shot cross-lingual tasks. However, these models are limited by having inconsistent contextualized representations of subwords across different languages. Existing work addresses this issue by bilingual projection and fine-tuning technique. We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT, which encourages model to align representations from source and multiple target languages once by mixing their context information. Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages. Experimental results on five tasks with 19 languages show that our method leads to significantly improved performances for all the tasks compared with mBERT.

[1]  Jonathan Pool,et al.  PanLex: Building a Resource for Panlingual Lexical Translation , 2014, LREC.

[2]  Anna Korhonen,et al.  Semantic Specialization of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints , 2017, TACL.

[3]  Jeremy Barnes,et al.  Bilingual Sentiment Embeddings: Joint Projection of Sentiment Across Languages , 2018, ACL.

[4]  Yijia Liu,et al.  Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing , 2019, EMNLP.

[5]  Holger Schwenk,et al.  A Corpus for Multilingual Document Classification in Eight Languages , 2018, LREC.

[6]  Veselin Stoyanov,et al.  Emerging Cross-lingual Structure in Pretrained Language Models , 2020, ACL.

[7]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[8]  Regina Barzilay,et al.  Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing , 2019, NAACL.

[9]  Haoran Li,et al.  Multilingual Seq2seq Training with Similarity Loss for Cross-Lingual Document Classification , 2018, Rep4NLP@ACL.

[10]  Phil Blunsom,et al.  Learning Bilingual Word Representations by Marginalizing Alignments , 2014, ACL.

[11]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[12]  Xin Wang,et al.  XL-NBT: A Cross-lingual Neural Belief Tracking Framework , 2018, EMNLP.

[13]  Anders Søgaard,et al.  A Survey of Cross-lingual Word Embedding Models , 2017, J. Artif. Intell. Res..

[14]  Claire Cardie,et al.  Unsupervised Multilingual Word Embeddings , 2018, EMNLP.

[15]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[16]  Yue Zhang,et al.  Cross-Lingual Dependency Parsing Using Code-Mixed TreeBank , 2019, EMNLP/IJCNLP.

[17]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[18]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[19]  David Yarowsky,et al.  A Representation Learning Framework for Multi-Source Transfer Parsing , 2016, AAAI.

[20]  Qun Liu,et al.  Dialog State Tracking with Reinforced Data Augmentation , 2019, AAAI.

[21]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[22]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[23]  Peng Xu,et al.  Attention-Informed Mixed-Language Training for Zero-shot Cross-lingual Task-oriented Dialogue Systems , 2019, AAAI.

[24]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Sebastian Schuster,et al.  Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog , 2018, NAACL.

[27]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.