An automatic approach to identify word sense changes in text media across timescales

In this paper, we propose an unsupervised and automated method to identify noun sense changes based on rigorous analysis of time-varying text data available in the form of millions of digitized books and millions of tweets posted per day. We construct distributional-thesauribased networks from data at different time points and cluster each of them separately to obtain word-centric sense clusters corresponding to the different time points. Subsequently, we propose a split/join based approach to compare the sense clusters at two different time points to find if there is ‘birth’ of a new sense. The approach also helps us to find if an older sense was ‘split’ into more than one sense or a newer sense has been formed from the ‘join’ of older senses or a particular sense has undergone ‘death’. We use this completely unsupervised approach (a) within the Google books data to identify word sense differences within a media, and (b) across Google books and Twitter data to identify differences in word sense distribution across different media. We conduct a thorough evaluation of the proposed methodology both manually as well as through comparison with WordNet.

[1]  Vittorio Loreto,et al.  On the origin of the hierarchy of color names , 2012, Proceedings of the National Academy of Sciences.

[2]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[3]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[4]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[5]  Marco Baroni,et al.  A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. , 2011, GEMS.

[6]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[7]  Thomas Risse,et al.  Towards automatic language evolution tracking A study on word sense tracking , 2011 .

[8]  Adam Kilgarriff,et al.  "I Don’t Believe in Word Senses" , 1997, Comput. Humanit..

[9]  Rada Mihalcea,et al.  Word Epoch Disambiguation: Finding How Words Change Over Time , 2012, ACL.

[10]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[11]  Animesh Mukherjee,et al.  Opinion formation in time-varying social networks: The case of Naming Game , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[13]  David Bamman,et al.  Measuring historical word sense variation , 2011, JCDL '11.

[14]  Oi Yee Kwong Aligning WordNet with Additional Lexical Resources , 1998, WordNet@ACL/COLING.

[15]  Katrin Erk,et al.  Investigations on Word Senses and Word Usages , 2009, ACL.

[16]  Christian Biemann,et al.  That’s sick dude!: Automatic identification of word sense change across different timescales , 2014, ACL.

[17]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR Forum.

[18]  Adam Kilgarriff,et al.  WORD SKETCH: Extraction and Display of Signicant Collocations for Lexicography , 2000 .

[19]  Krister Lindén,et al.  Finding a Location for a New Word in WordNet , 2012 .

[20]  Adam Kilgarriff,et al.  An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments) , 2007, ACL.

[21]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[22]  Hitoshi Isahara,et al.  Enhancing the Japanese WordNet , 2009, ALR7@IJCNLP.

[23]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[24]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[25]  Christian Biemann Creating a system for lexical substitutions from scratch using crowdsourcing , 2013, Lang. Resour. Evaluation.

[26]  Gerhard Heyer,et al.  Change of Topics over Time - Tracking Topics by their Change of Meaning , 2009, KDIR.

[27]  Suzanne Stevenson,et al.  Automatically Identifying Changes in the Semantic Orientation of Words , 2010, LREC.

[28]  Vittorio Loreto,et al.  Aging in Language Dynamics , 2011, PloS one.

[29]  András Kornai Zipf’s law outside the middle range , 2007 .

[30]  D. Wijaya,et al.  Understanding semantic change of words over centuries , 2011, DETECT '11.

[31]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[32]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[33]  Christian Biemann,et al.  Text: now in 2D! A framework for lexical expansion with contextual similarity , 2013, J. Lang. Model..

[34]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[35]  Timothy Baldwin,et al.  A lexicographic appraisal of an automatic approach for detecting new word-senses , 2013 .

[36]  Christian Biemann Co-Occurrence Cluster Features for Lexical Substitutions in Context , 2010, TextGraphs@ACL.

[37]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[38]  Christian Biemann,et al.  Distributed Distributional Similarities of Google Books over the Centuries , 2014, LREC.

[39]  Christian Biemann Structure Discovery in Natural Language , 2012, Theory and Applications of Natural Language Processing.