Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and - as we show in this work - result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).

[1]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[2]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[3]  Mark Davies Corpus of Historical American English (COHA) , 2010 .

[4]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Christian Biemann,et al.  That’s sick dude!: Automatic identification of word sense change across different timescales , 2014, ACL.

[8]  Kevin Duh,et al.  A framework for analyzing semantic change of words across time , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[9]  Slav Petrov,et al.  Temporal Analysis of Language through Neural Language Models , 2014, LTCSS@ACL.

[10]  M. de Rijke,et al.  Ad Hoc Monitoring of Vocabulary Shifts over Time , 2015, CIKM.

[11]  Andrey Kutuzov,et al.  Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian , 2015, CICLing.

[12]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[13]  Steven Skiena,et al.  Statistically Significant Detection of Linguistic Change , 2014, WWW.

[14]  Yang Xu,et al.  A Computational Evaluation of Two Laws of Semantic Change , 2015, CogSci.

[15]  Tin Kam Ho,et al.  Concept Evolution Modeling Using Semantic Vectors , 2016, WWW.

[16]  Jure Leskovec,et al.  Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change , 2016, ACL.

[17]  Mirella Lapata,et al.  A Bayesian Model of Diachronic Meaning Change , 2016, TACL.

[18]  Jure Leskovec,et al.  Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change , 2016, EMNLP.

[19]  David M. Blei,et al.  Structured Embedding Models for Grouped Data , 2017, NIPS.

[20]  Maarten Marx,et al.  UvA-DARE (Digital Academic Repository) Words are Malleable: Computing Semantic Shifts in Political and Media Discourse , 2017 .

[21]  Stephan Mandt,et al.  Dynamic Word Embeddings , 2017, ICML.

[22]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[23]  David M. Blei,et al.  Dynamic Embeddings for Language Evolution , 2018, WWW.

[24]  Eric Fleury,et al.  Socioeconomic Dependencies of Linguistic Patterns in Twitter: a Multivariate Analysis , 2018, WWW.

[25]  Lars Borin,et al.  Survey of Computational Approaches to Lexical Semantic Change , 2018, 1811.06278.

[26]  Erik Velldal,et al.  Diachronic word embeddings and semantic shifts: a survey , 2018, COLING.

[27]  Dominik Schlechtweg,et al.  Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change , 2018, NAACL.

[28]  Rada Mihalcea,et al.  Factors Influencing the Surprising Instability of Word Embeddings , 2018, NAACL.

[29]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[30]  David Mimno,et al.  Evaluating the Stability of Embedding-based Word Similarities , 2018, TACL.

[31]  Eneko Agirre,et al.  A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings , 2018, ACL.

[32]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[33]  Ryan Cotterell,et al.  Generalizing Procrustes Analysis for Better Bilingual Dictionary Induction , 2018, CoNLL.

[34]  Hui Xiong,et al.  Dynamic Word Embeddings for Evolving Semantic Discovery , 2017, WSDM.

[35]  Benno Stein,et al.  Celebrity Profiling , 2019, ACL.

[36]  Gemma Boleda,et al.  Short-Term Meaning Shift: A Distributional Exploration , 2018, NAACL.

[37]  Kira Radinsky,et al.  Generating Timelines by Modeling Semantic Change , 2019, CoNLL.

[38]  Shen Li,et al.  Diachronic Sense Modeling with Deep Contextualized Word Embeddings: An Ecological View , 2019, ACL.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Dominik Schlechtweg,et al.  SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction , 2019, *SEMEVAL.

[41]  M. Giulianelli Lexical Semantic Change Analysis with Contextualised Word Representations , 2019 .

[42]  Dominik Schlechtweg,et al.  A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains , 2019, ACL.

[43]  Alexandre Allauzen,et al.  Étude des variations sémantiques à travers plusieurs dimensions , 2020 .

[44]  Petra Kralj Novak,et al.  Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift , 2019, LREC.