Unsupervised Word Usage Similarity in Social Media Texts

We propose an unsupervised method for automatically calculating word usage similarity in social media data based on topic modelling, which we contrast with a baseline distributional method and Weighted Textual Matrix Factorization. We evaluate these methods against a novel dataset made up of human ratings over 550 Twitter message pairs annotated for usage similarity for a set of 10 nouns. The results show that our topic modelling approach outperforms the other two methods.

[1]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[2]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[5]  Katrin Erk,et al.  Investigations on Word Senses and Word Usages , 2009, ACL.

[6]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[7]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[8]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[9]  Suresh Manandhar,et al.  SemEval-2010 Task 14: Word Sense Induction &Disambiguation , 2010, SemEval@ACL.

[10]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[11]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[12]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[13]  Ed H. Chi,et al.  Language Matters In Twitter: A Large Scale Study , 2011, ICWSM.

[14]  Timothy Baldwin,et al.  Word Sense Induction for Novel Sense Detection , 2012, EACL.

[15]  Chris Dyer,et al.  Part-of-Speech Tagging for Twitter : Word Clusters and Other Advances , 2012 .

[16]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[17]  M. Osborne,et al.  Bieber no more : First Story Detection using Twitter and Wikipedia , 2012 .

[18]  Weiwei Guo,et al.  Weiwei: A Simple Unsupervised Latent Semantics based Approach for Sentence Similarity , 2012, SemEval@NAACL-HLT.

[19]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[20]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[21]  Timothy Baldwin,et al.  Unsupervised Estimation of Word Usage Similarity , 2012, ALTA.

[22]  Weiwei Guo,et al.  Modeling Sentences in the Latent Space , 2012, ACL.

[23]  Timothy Baldwin,et al.  On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online , 2012, COLING.