Clustering tweets usingWikipedia concepts

Two challenging issues are notable in tweet clustering. Firstly, the sparse data problem is serious since no tweet can be longer than 140 characters. Secondly, synonymy and polysemy are rather common because users intend to present a unique meaning with a great number of manners in tweets. Enlightened by the recent research which indicates Wikipedia is promising in representing text, we exploit Wikipedia concepts in representing tweets with concept vectors. We address the polysemy issue with a Bayesian model, and the synonymy issue by exploiting the Wikipedia redirections. To further alleviate the sparse data problem, we further make use of three types of out-links in Wikipedia. Evaluation on a twitter dataset shows that the concept model outperforms the traditional VSM model in tweet clustering.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[3]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[4]  Anand Karandikar,et al.  Clustering short status messages: A topic model based approach , 2010 .

[5]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[6]  Chorkin Chan,et al.  Chinese Word Segmentation based on Maximum Matching and Word Binding Force , 1996, COLING.

[7]  Mirella Lapata,et al.  Bayesian Word Sense Induction , 2009, EACL.

[8]  Andreas Stafylopatis,et al.  Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents , 2012, Comput. J..

[9]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[10]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[13]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[14]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[17]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[18]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[19]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[20]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[21]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .