Unsupervised Construction of Topic-Based Twitter Lists

The Twitter lists feature was launched in late 2009 and enables the creation of curated groups containing Twitter users. Each user can be a list author and decide the basis on which other users are added to a list. The most popular lists are those that associate with a topic. Twitter lists can be used as a powerful organisation tool, but its widespread adoption has been limited. The two main obstacles are the initial setup time and the effort of continual curation. In this paper we attempt to solve the first problem by applying unsupervised clustering algorithms to construct topic-based Twitter lists. We consider k-means and affinity propagation (AP) as clustering algorithms and evaluate these algorithms using two document representation techniques. The selected representation techniques are the popular term frequency-inverse document frequency (TF-IDF) and the latent Dirichlet allocation (LDA) topic model. We calculate the similarities for the clustering algorithms using five well-known similarity measures that have been used extensively in the text domain. The adjusted normalised information distance (ANID) was used to compare the clustering result yielded by k-means and affinity propagation. We found that the careful selection of a similarity measure, combined with the LDA topic model can provide a user with a sensible starting point for list creation.

[1]  Wei-Ying Ma,et al.  Recommending friends and locations based on individual location history , 2011, ACM Trans. Web.

[2]  Ashish V. Tendulkar,et al.  Comparative study of clustering techniques for short text documents , 2011, WWW.

[3]  John Hannon,et al.  Recommending twitter users to follow using content and collaborative filtering approaches , 2010, RecSys '10.

[4]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[5]  Alok N. Choudhary,et al.  Twitter Trending Topic Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[6]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  C. Elkan,et al.  Topic Models , 2008 .

[9]  Alexander Kraskov,et al.  Published under the scientific responsability of the EUROPEAN PHYSICAL SOCIETY Incorporating , 2002 .

[10]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[11]  Andrew H. Sung,et al.  A Similarity Measure for Clustering and its Applications , 2008 .

[12]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[13]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[14]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[15]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[16]  V. Jawahar Senthil Kumar,et al.  Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval , 2010, 2010 First International Conference on Integrated Intelligent Computing.

[17]  Xiaolong Wang,et al.  Topic based automatic news recommendation using topic model and affinity propagation , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[18]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[21]  Marco Pennacchiotti,et al.  Investigating topic models for social media user recommendation , 2011, WWW.

[22]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[23]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[24]  R. Ravi,et al.  Game-Theoretic Models of Information Overload in Social Networks , 2010, WAW.