TransAug: Translate as Augmentation for Sentence Embeddings

While contrastive learning greatly advances the representation of sentence embeddings, it is still limited by the size of the existing sentence datasets. In this paper, we present TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text, and introduce a two-stage paradigm to advances the state-ofthe-art sentence embeddings. Instead of adopting an encoder trained in other languages setting, we first distill a Chinese encoder from a SimCSE encoder (pretrained in English), so that their embeddings are close in semantic space, which can be regraded as implicit data augmentation. Then, we only update the English encoder via cross-lingual contrastive learning and frozen the distilled Chinese encoder. Our approach achieves a new stateof-art on standard semantic textual similarity (STS), outperforming both SimCSE and Sentence-T5, and the best performance in corresponding tracks on transfer tasks evaluated by SentEval.

[1]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[2]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[3]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[4]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[5]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[8]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[9]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[10]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[11]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[12]  Tianyu Gao,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[13]  Yiming Yang,et al.  On the Sentence Embeddings from BERT for Semantic Textual Similarity , 2020, EMNLP.

[14]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[15]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[16]  Keith B. Hall,et al.  Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , 2021, FINDINGS.

[17]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[18]  Ray Kurzweil,et al.  Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax , 2019, IJCAI.

[19]  A. Kilgarriff *SEM 2012: The First Joint Conference on Lexical and Computational Semantics , 2012 .

[20]  Mingxuan Wang,et al.  Contrastive Learning for Many-to-many Multilingual Neural Machine Translation , 2021, ACL.

[21]  Bo Zhao,et al.  AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding , 2017, ArXiv.

[22]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[23]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[24]  Ellen M. Voorhees,et al.  Building a question answering test collection , 2000, SIGIR '00.

[25]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[26]  Weijia Wu,et al.  EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling , 2021, ArXiv.

[27]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[28]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[29]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[30]  Ian S. Dunn,et al.  Exploring the Limits , 2009 .

[31]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[32]  Amaru Cuba Gyllensten,et al.  Semantic Re-tuning with Contrastive Tension , 2021, ICLR.

[33]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[34]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[35]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[36]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[37]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[38]  Naveen Arivazhagan,et al.  Language-agnostic BERT Sentence Embedding , 2020, ArXiv.

[39]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[40]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[41]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[42]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.