Efficient Clustering of Short Messages into General Domains

The ever increasing activity in social networks is mainly manifested by a growing stream of status updating or microblogging. The massive stream of updates emphasizes the need for accurate and efficient clustering of short messages on a large scale. Applying traditional clustering techniques is both inaccurate and inefficient due to sparseness. This paper presents an accurate and efficient algorithm for clustering Twitter tweets. We break the clustering task into two distinctive tasks/stages: (1) batch clustering of user annotated data, and (2) online clustering of a stream of tweets. In the first stage we rely on the habit of `tagging', common in social media streams (e.g. hashtags), thus the algorithm can bootstrap on the tags for clustering of a large pool of hashtagged tweets. The stable clusters achieved in the first stage lend themselves for online clustering of a stream of (mostly) tagless messages. We evaluate our results against gold-standard classification and validate the results by employing multiple clusteringevaluation measures (information theoretic, paired, F and greedy). We compare our algorithm to a number of otherclustering algorithms and various types of feature sets. Results show that the algorithm presented is both accurate andefficient and can be easily used for large scale clustering of sparse messages as the heavy lifting is achieved ona sublinear number of documents.

[1]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[2]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[3]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[4]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[5]  Ari Rappoport,et al.  The NVI Clustering Evaluation Measure , 2009, CoNLL.

[6]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[7]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[8]  Jure Leskovec,et al.  Patterns of temporal variation in online media , 2011, WSDM '11.

[9]  Kristina Lerman,et al.  Analyzing microblogs with affinity propagation , 2010, SOMA '10.

[10]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[11]  Jing Jiang,et al.  An Empirical Comparison of Topics in Twitter and Traditional Media , 2011 .

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  Jon Kleinberg,et al.  Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter , 2011, WWW.

[14]  Ari Rappoport,et al.  Enhanced Sentiment Learning Using Twitter Hashtags and Smileys , 2010, COLING.

[15]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[16]  Ari Rappoport,et al.  Type Level Clustering Evaluation: New Measures and a POS Induction Case Study , 2010, CoNLL.

[17]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[18]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[19]  Barry Smyth,et al.  Towards tagging and categorization for micro-blogs , 2010, AAAI 2010.

[20]  Ashish V. Tendulkar,et al.  Comparative study of clustering techniques for short text documents , 2011, WWW.

[21]  David M. W. Powers,et al.  Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[22]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[23]  Ari Rappoport,et al.  Scalable multi stage clustering of tagged micro-messages , 2012, WWW.

[24]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[25]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[26]  Grigory Begelman,et al.  Automated Tag Clustering: Improving search and exploration in the tag space , 2006 .

[27]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .