Statistically Significant Detection of Linguistic Change

We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word's meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time. We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.

[1]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[2]  Ellen Isaacs,et al.  Teen use of messaging media , 2002, CHI Extended Abstracts.

[3]  Slav Petrov,et al.  Temporal Analysis of Language through Neural Language Models , 2014, LTCSS@ACL.

[4]  Gerhard Heyer,et al.  Change of Topics over Time - Tracking Topics by their Change of Meaning , 2009, KDIR.

[5]  Harri Siirtola,et al.  Variation in noun and pronoun frequencies in a sociohistorical corpus of English , 2011, Lit. Linguistic Comput..

[6]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[8]  Kevin Duh,et al.  A framework for analyzing semantic change of words across time , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[9]  H. Varian,et al.  Predicting the Present with Google Trends , 2012 .

[10]  Timothy Radcliffe,et al.  Books of the Century , 2000 .

[11]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[12]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[13]  Eleftherios Mylonakis,et al.  Google trends: a web-based tool for real-time surveillance of disease outbreaks. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Guy Merchant Teenagers in cyberspace: an investigation of language use and language change in internet chatrooms , 2001 .

[16]  Ryan P. Adams,et al.  Bayesian Online Changepoint Detection , 2007, 0710.3742.

[17]  H. Varian,et al.  Predicting the Present with Google Trends , 2009 .

[18]  D. Wijaya,et al.  Understanding semantic change of words over centuries , 2011, DETECT '11.

[19]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[20]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[21]  David Crystal,et al.  Internet Linguistics: A Student Guide , 2011 .

[22]  Timothy Baldwin,et al.  A lexicographic appraisal of an automatic approach for detecting new word-senses , 2013 .

[23]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[24]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[25]  W. A. Taylor Change-Point Analysis : A Powerful New Tool For Detecting Changes , 2000 .

[26]  Steven Skiena,et al.  Inducing Language Networks from Continuous Space Word Representations , 2014, CompleNet.

[27]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[28]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[29]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[30]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[31]  Christian Biemann,et al.  That’s sick dude!: Automatic identification of word sense change across different timescales , 2014, ACL.

[32]  Steven Skiena,et al.  The Expressive Power of Word Embeddings , 2013, ArXiv.

[33]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[34]  S. Tagliamonte,et al.  LINGUISTIC RUIN? LOL! INSTANT MESSAGING AND TEEN LANGUAGE , 2008 .

[35]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[36]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[37]  Patrick Juola,et al.  The Time Course of Language Change , 2003, Comput. Humanit..

[38]  Terttu Nevalainen,et al.  CEECing the baseline: lexical stability and significant change in a historical corpus , 2012 .

[39]  David Zhang,et al.  Enhanced Search with Wildcards and Morphological Inflections in the Google Books Ngram Viewer , 2014, ACL.

[40]  Marco Baroni,et al.  A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. , 2011, GEMS.

[41]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[42]  Xiaohe Chen,et al.  Semantic Change Computation: A Successive Approach , 2013, BSI@PAKDD/BSIC@IJCAI.

[43]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[44]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[45]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.