ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

Recent studies have demonstrated that pretrained cross-lingual models achieve impressive performance on downstream cross-lingual tasks. This improvement stems from the learning of a large amount of monolingual and parallel corpora. While it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for the low-resource languages. In this paper, we propose ERNIEM, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to break the constraint of parallel corpus size on the model performance. Our key insight is to integrate the idea of back translation in the pretraining process. We generate pseudo-parallel sentences pairs on a monolingual corpus to enable the learning of semantic alignment between different languages, which enhances the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results on various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.

[1]  Pushpak Bhattacharyya,et al.  The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[2]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[3]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[4]  Sebastian Riedel,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[5]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[8]  Shuangzhi Wu,et al.  Alternating Language Modeling for Cross-Lingual Pre-Training , 2020, AAAI.

[9]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[10]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[11]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[12]  Yuqing Tang,et al.  Cross-lingual Retrieval for Iterative Self-Supervised Training , 2020, NeurIPS.

[13]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[14]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Xiaogang Wang,et al.  CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[19]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[20]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[21]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[22]  Luo Si,et al.  VECO: Variable Encoder-decoder Pre-training for Cross-lingual Understanding and Generation , 2020, ArXiv.

[23]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[24]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[25]  Josef van Genabith,et al.  Neural machine translation for low-resource languages without parallel corpora , 2017, Machine Translation.

[26]  Weihua Luo,et al.  On Learning Universal Representations Across Languages , 2021, ICLR.

[27]  Li Dong,et al.  Cross-Lingual Natural Language Generation via Pre-Training , 2020, AAAI.

[28]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[29]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[30]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[31]  Di He,et al.  Non-Autoregressive Machine Translation with Auxiliary Regularization , 2019, AAAI.

[32]  Ming Zhou,et al.  InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training , 2021, NAACL.

[33]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[34]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.