Obtaining Better Static Word Embeddings Using Contextual Embedding Models

The advent of contextual word embeddings— representations of words which incorporate semantic and syntactic information from their context—has led to tremendous improvements on a wide variety of NLP tasks. However, recent contextual models have prohibitively high computational cost in many use-cases and are often hard to interpret. In this work, we demonstrate that our proposed distillation method, which is a simple extension of CBOW-based training, allows to significantly improve computational efficiency of NLP applications, while outperforming the quality of existing static embeddings trained from scratch as well as those distilled from previously proposed methods. As a side-effect, our approach also allows a fair comparison of both contextual and static embeddings via standard lexical evaluation tasks.

[1]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[2]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[3]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[4]  L. Barsalou Context-independent and context-dependent information in concepts , 1982, Memory & cognition.

[5]  Qingyu Chen,et al.  BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[6]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[7]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[8]  Paula Rubio-Fernández,et al.  Concept Narrowing: The Role of Context-independent Information , 2008, J. Semant..

[9]  M. Jackson What do you mean? , 1989, Geriatric nursing.

[10]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[11]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[12]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[13]  Anders Sogaard,et al.  Are All Good Word Vector Spaces Isomorphic? , 2020, EMNLP.

[14]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[15]  Harith Alani,et al.  On the use of Jargon and Word Embeddings to Explore Subculture within the Reddit’s Manosphere , 2020, WebSci.

[16]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[17]  Jeanna Neefe Matthews,et al.  Studying Political Bias via Word Embeddings , 2020, WWW.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[20]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[21]  Claire Cardie,et al.  Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings , 2020, ACL.

[22]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[23]  Steven Schockaert,et al.  Combining BERT with Static Word Embeddings for Categorizing Social Media , 2020, WNUT.

[24]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[25]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[28]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[29]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[30]  Felix Hill,et al.  SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity , 2016, EMNLP.

[31]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[32]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[33]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[34]  Max F. Meyer,et al.  The Proof and Measurement of Association between Two Things. , 1904 .

[35]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[36]  Dan Jurafsky,et al.  Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks , 2020, AERA Open.

[37]  Diana Inkpen,et al.  Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and Second-Order Co-occurrence Measures , 2011, Canadian Conference on AI.

[38]  Danushka Bollegala,et al.  Gender-preserving Debiasing for Pre-trained Word Embeddings , 2019, ACL.

[39]  Arzucan Özgür,et al.  Linking entities through an ontology using word embeddings and syntactic re-ranking , 2019, BMC Bioinformatics.

[40]  Xiaoyong Du,et al.  Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics , 2017, EMNLP.

[41]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[42]  Rosa L. Figueroa,et al.  Application of Machine Learning and Word Embeddings in the Classification of Cancer Diagnosis Using Patient Anamnesis , 2020, IEEE Access.

[43]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[44]  Christopher Cochrane,et al.  Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora , 2019, Political Analysis.

[45]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[46]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[47]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Yulia Tsvetkov,et al.  Problems With Evaluation of Word Embeddings Using Word Similarity Tasks , 2016, RepEval@ACL.

[50]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[51]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[52]  Barbara McGillivray,et al.  Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings , 2019, EMNLP.

[53]  Kees van Deemter,et al.  What do you mean, BERT? Assessing BERT as a Distributional Semantics Model , 2019, ArXiv.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  Matteo Pagliardini,et al.  Better Word Embeddings by Disentangling Contextual n-Gram Information , 2019, NAACL.