Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders

Pretrained Masked Language Models (MLMs) have revolutionised NLP in recent years. However, previous work has indicated that off-theshelf MLMs are not effective as universal lexical or sentence encoders without further taskspecific fine-tuning on NLI, sentence similarity, or paraphrasing tasks using annotated task data. In this work, we demonstrate that it is possible to turn MLMs into effective universal lexical and sentence encoders even without any additional data and without any supervision. We propose an extremely simple, fast and effective contrastive learning technique, termed Mirror-BERT, which converts MLMs (e.g., BERT and RoBERTa) into such encoders in less than a minute without any additional external knowledge. Mirror-BERT relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) finetuning examples, and aims to maximise their similarity during “identity fine-tuning”. We report huge gains over off-the-shelf MLMs with Mirror-BERT in both lexical-level and sentence-level tasks, across different domains and different languages. Notably, in the standard sentence semantic similarity (STS) tasks, our self-supervised Mirror-BERT model even matches the performance of the task-tuned Sentence-BERT models from prior work. Finally, we delve deeper into the inner workings of MLMs, and suggest some evidence on why this simple approach can yield effective univeral lexical and sentence encoders.

[1]  Lidong Bing,et al.  Bootstrapped Unsupervised Sentence Representation Learning , 2021, ACL.

[2]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[3]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[6]  Matthew Henderson,et al.  Training Neural Response Selection for Task-Oriented Dialogue Systems , 2019, ACL.

[7]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[8]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[9]  Tianyu Gao,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[10]  Goran Glavas,et al.  Probing Pretrained Language Models for Lexical Semantics , 2020, EMNLP.

[11]  Nigel Collier,et al.  Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation , 2016, ACL.

[12]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[13]  Mohammad Taher Pilehvar,et al.  A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space , 2021, ACL.

[14]  Nigel Collier,et al.  COMETA: A Corpus for Medical Entity Linking in the Social Media , 2020, EMNLP.

[15]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[16]  Jiarun Cao,et al.  Whitening Sentence Representations for Better Semantics and Faster Retrieval , 2021, ArXiv.

[17]  Gary D. Bader,et al.  DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations , 2020, ACL.

[18]  Goran Glavas,et al.  Is Supervised Syntactic Parsing Beneficial for Language Understanding Tasks? An Empirical Investigation , 2020, EACL.

[19]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[20]  Marco Basaldella,et al.  Self-alignment Pre-training for Biomedical Entity Representations , 2020, ArXiv.

[21]  Goran Glavas,et al.  LexFit: Lexical Fine-Tuning of Pretrained Language Models , 2021, ACL.

[22]  Fuzheng Zhang,et al.  ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer , 2021, ACL.

[23]  Balaji Lakshminarayanan,et al.  AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty , 2020, ICLR.

[24]  Kwan Hui Lim,et al.  An Unsupervised Sentence Embedding Method by Mutual Information Maximization , 2020, EMNLP.

[25]  Nigel Collier,et al.  Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models , 2018, EMNLP 2018.

[26]  Furu Wei,et al.  Scheduled DropHead: A Regularization Method for Transformer Models , 2020, FINDINGS.

[27]  Naveen Arivazhagan,et al.  Language-agnostic BERT Sentence Embedding , 2020, ArXiv.

[28]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[29]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[32]  Sang-goo Lee,et al.  Self-Guided Contrastive Learning for BERT Sentence Representations , 2021, ACL.

[33]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[34]  Yiming Yang,et al.  On the Sentence Embeddings from BERT for Semantic Textual Similarity , 2020, EMNLP.

[35]  Simone Paolo Ponzetto,et al.  Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval , 2021, ECIR.

[36]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[37]  Kenneth Ward Church,et al.  Isotropy in the Contextual Embedding Space: Clusters and Manifolds , 2021, ICLR.

[38]  Pascal Vincent,et al.  Dropout as data augmentation , 2015, ArXiv.

[39]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[40]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[41]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[42]  Pramod Viswanath,et al.  All-but-the-Top: Simple and Effective Postprocessing for Word Representations , 2017, ICLR.

[43]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[45]  Jason Weston,et al.  Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2020, ICLR.

[46]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[47]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[48]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[49]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[50]  Sanjeev Arora,et al.  A Latent Variable Model Approach to PMI-based Word Embeddings , 2015, TACL.

[51]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[52]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[53]  Isabelle Augenstein,et al.  A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned, and Perspectives , 2021, ACM Comput. Surv..

[54]  Paul N. Bennett,et al.  COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining , 2021, NeurIPS.

[55]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[56]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[57]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[58]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[59]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[60]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[61]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[62]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[63]  Iryna Gurevych,et al.  Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks , 2021, NAACL.

[64]  Philipp Cimiano,et al.  Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0 , 2014, LREC.

[65]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[66]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[67]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[68]  Goran Glavas,et al.  How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions , 2019, ACL.

[69]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[70]  Amaru Cuba Gyllensten,et al.  Semantic Re-tuning with Contrastive Tension , 2021, ICLR.

[71]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[72]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[73]  Madian Khabsa,et al.  CLEAR: Contrastive Learning for Sentence Representation , 2020, ArXiv.

[74]  Thierry Poibeau,et al.  Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity , 2020, Computational Linguistics.

[75]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[76]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.