On Learning Universal Representations Across Languages

Recent studies have demonstrated the overwhelming advantage of cross-lingual pre-trained models (PTMs), such as multilingual BERT and XLM, on cross-lingual NLP tasks. However, existing approaches essentially capture the co-occurrence among tokens through involving the masked language model (MLM) objective with token-level cross entropy. In this work, we extend these approaches to learn sentence-level representations, and show the effectiveness on cross-lingual understanding and generation. We propose Hierarchical Contrastive Learning (HiCTL) to (1) learn universal representations for parallel sentences distributed in one or multiple languages and (2) distinguish the semantically-related words from a shared cross-lingual vocabulary for each sentence. We conduct evaluations on three benchmarks: language understanding tasks (QQP, QNLI, SST-2, MRPC, STS-B and MNLI) in the GLUE benchmark, cross-lingual natural language inference (XNLI) and machine translation. Experimental results show that the HiCTL obtains an absolute gain of 1.0%/2.2% accuracy on GLUE/XNLI as well as achieves substantial improvements of +1.7-+3.6 BLEU on both the high-resource and low-resource English-to-X translation tasks over strong baselines. We will release the source codes as soon as possible.

[1]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[2]  Zhe Gan,et al.  FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding , 2021, AAAI.

[3]  Mikel Artetxe,et al.  On the Cross-lingual Transferability of Monolingual Representations , 2019, ACL.

[4]  Kevin Gimpel,et al.  Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , 2016, ArXiv.

[5]  Sergey Edunov,et al.  Pre-trained language model representations for language generation , 2019, NAACL.

[6]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[7]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[8]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[9]  Pengfei Liu,et al.  Extractive Summarization as Text Matching , 2020, ACL.

[10]  Shih-Fu Chang,et al.  Unsupervised Embedding Learning via Invariant and Spreading Instance Feature , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[12]  Ming Zhou,et al.  HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[13]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[14]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[15]  Pierre Zweigenbaum,et al.  Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[16]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[17]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[19]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[20]  Weihua Luo,et al.  Multiscale Collaborative Deep Models for Neural Machine Translation , 2020, ACL.

[21]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[22]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[23]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[24]  Ming Zhou,et al.  InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training , 2021, NAACL.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Pushpak Bhattacharyya,et al.  The IIT Bombay English-Hindi Parallel Corpus , 2017, LREC.

[27]  Rico Sennrich,et al.  Deep architectures for Neural Machine Translation , 2017, WMT.

[28]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[29]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[30]  Shujian Huang,et al.  Acquiring Knowledge from Pre-trained Model to Neural Machine Translation , 2019, AAAI.

[31]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[32]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[33]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[34]  Tie-Yan Liu,et al.  Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[35]  Ming Zhou,et al.  Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[36]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[37]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[38]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[39]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[40]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[41]  Jianfeng Gao,et al.  Very Deep Transformers for Neural Machine Translation , 2020, ArXiv.

[42]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[43]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[45]  Sebastian Riedel,et al.  MLQA: Evaluating Cross-lingual Extractive Question Answering , 2019, ACL.

[46]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[47]  Noah A. Smith,et al.  Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation , 2020, ArXiv.

[48]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[49]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[50]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[51]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[52]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[53]  Luo Si,et al.  VECO: Variable Encoder-decoder Pre-training for Cross-lingual Understanding and Generation , 2020, ArXiv.

[54]  Andrej Risteski,et al.  On Learning Language-Invariant Representations for Universal Machine Translation , 2020, ICML.

[55]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[56]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[57]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[58]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[59]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[60]  Jiwen Lu,et al.  Hardness-Aware Deep Metric Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Yuji Matsumoto,et al.  Universal Dependencies 2.1 , 2017 .

[62]  Marcin Junczys-Dowmunt,et al.  From Research to Production and Back: Ludicrously Fast Neural Machine Translation , 2019, EMNLP.

[63]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[64]  Xu Sun,et al.  Bag-of-Words as Target for Neural Machine Translation , 2018, ACL.

[65]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[66]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[67]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[68]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[69]  Lei Li,et al.  Towards Making the Most of BERT in Neural Machine Translation , 2020, AAAI.

[70]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[71]  Dan Iter,et al.  Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models , 2020, ACL.

[72]  Enhong Chen,et al.  Incorporating BERT into Parallel Sequence Decoding with Adapters , 2020, NeurIPS.

[73]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.