That’s sick dude!: Automatic identification of word sense change across different timescales

In this paper, we propose an unsupervised method to identify noun sense changes based on rigorous analysis of time-varying text data available in the form of millions of digitized books. We construct distributional thesauri based networks from data at different time points and cluster each of them separately to obtain word-centric sense clusters corresponding to the different time points. Subsequently, we compare these sense clusters of two different time points to find if (i) there is birth of a new sense or (ii) if an older sense has got split into more than one sense or (iii) if a newer sense has been formed from the joining of older senses or (iv) if a particular sense has died. We conduct a thorough evaluation of the proposed methodology both manually as well as through comparison with WordNet. Manual evaluation indicates that the algorithm could correctly identify 60.4% birth cases from a set of 48 randomly picked samples and 57% split/join cases from a set of 21 randomly picked samples. Remarkably, in 44% cases the birth of a novel sense is attested by WordNet, while in 46% cases and 43% cases split and join are respectively confirmed by WordNet. Our approach can be applied for lexicography, as well as for applications like word sense disambiguation or semantic search.

[1]  Adam Kilgarriff,et al.  WORD SKETCH: Extraction and Display of Signicant Collocations for Lexicography , 2000 .

[2]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[3]  Christian Biemann,et al.  Distributed Distributional Similarities of Google Books over the Centuries , 2014, LREC.

[4]  Christian Biemann,et al.  Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri , 2013, EMNLP.

[5]  Vittorio Loreto,et al.  On the origin of the hierarchy of color names , 2012, Proceedings of the National Academy of Sciences.

[6]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[7]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[8]  Timothy Baldwin,et al.  A lexicographic appraisal of an automatic approach for detecting new word-senses , 2013 .

[9]  Animesh Mukherjee,et al.  Opinion formation in time-varying social networks: The case of Naming Game , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Thomas Risse,et al.  Towards automatic language evolution tracking A study on word sense tracking , 2011 .

[11]  Christian Biemann Structure Discovery in Natural Language , 2012, Theory and Applications of Natural Language Processing.

[12]  David Bamman,et al.  Measuring historical word sense variation , 2011, JCDL '11.

[13]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[14]  Jure Leskovec,et al.  Learning to Discover Social Circles in Ego Networks , 2012, NIPS.

[15]  Krister Lindén,et al.  Finding a Location for a New Word in WordNet , 2012 .

[16]  Rada Mihalcea,et al.  Word Epoch Disambiguation: Finding How Words Change Over Time , 2012, ACL.

[17]  Gerhard Heyer,et al.  Change of Topics over Time - Tracking Topics by their Change of Meaning , 2009, KDIR.

[18]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[19]  Vittorio Loreto,et al.  Aging in Language Dynamics , 2011, PloS one.

[20]  Hitoshi Isahara,et al.  Enhancing the Japanese WordNet , 2009, ALR7@IJCNLP.

[21]  D. Wijaya,et al.  Understanding semantic change of words over centuries , 2011, DETECT '11.

[22]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[23]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[24]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[25]  Adam Kilgarriff,et al.  An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments) , 2007, ACL.

[26]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[27]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[28]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.