Explainable User Clustering in Short Text Streams

User clustering has been studied from different angles: behavior-based, to identify similar browsing or search patterns, and content-based, to identify shared interests. Once user clusters have been found, they can be used for recommendation and personalization. So far, content-based user clustering has mostly focused on static sets of relatively long documents. Given the dynamic nature of social media, there is a need to dynamically cluster users in the context of short text streams. User clustering in this setting is more challenging than in the case of long documents as it is difficult to capture the users' dynamic topic distributions in sparse data settings. To address this problem, we propose a dynamic user clustering topic model (or UCT for short). UCT adaptively tracks changes of each user's time-varying topic distribution based both on the short texts the user posts during a given time period and on the previously estimated distribution. To infer changes, we propose a Gibbs sampling algorithm where a set of word-pairs from each user is constructed for sampling. The clustering results are explainable and human-understandable, in contrast to many other clustering algorithms. For evaluation purposes, we work with a dataset consisting of users and tweets from each user. Experimental results demonstrate the effectiveness of our proposed clustering model compared to state-of-the-art baselines.

[1]  Yan Zhang,et al.  User Based Aggregation for Biterm Topic Model , 2015, ACL.

[2]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[3]  M. de Rijke,et al.  Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes , 2015, SIGIR.

[4]  M. de Rijke,et al.  Personalized time-aware tweets summarization , 2013, SIGIR.

[5]  M. de Rijke,et al.  Fusion helps diversification , 2014, SIGIR.

[6]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[7]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[8]  M. de Rijke,et al.  Hierarchical multi-label classification of social text streams , 2014, SIGIR.

[9]  Jie Yin,et al.  Clustering Microtext Streams for Event Identification , 2013, IJCNLP.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  M. de Rijke,et al.  Burst-aware data fusion for microblog search , 2015, Inf. Process. Manag..

[12]  Ryen W. White,et al.  Large-scale analysis of individual and task differences in search result page examination strategies , 2012, WSDM '12.

[13]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[14]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[15]  Marcel Worring,et al.  Unsupervised, Efficient and Semantic Expertise Retrieval , 2016, WWW.

[16]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[17]  Bing Liu,et al.  Mining topics in documents: standing on the shoulders of big data , 2014, KDD.

[18]  Jun Zhang,et al.  Dirichlet Process Mixture Model for Document Clustering with Feature Partition , 2013, IEEE Transactions on Knowledge and Data Engineering.

[19]  M. Cugmas,et al.  On comparing partitions , 2015 .

[20]  Ari Rappoport,et al.  Efficient Clustering of Short Messages into General Domains , 2013, ICWSM.

[21]  Jaideep Srivastava,et al.  Creating adaptive Web sites through usage-based clustering of URLs , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[22]  Katja Hofmann,et al.  Contextual factors for finding similar experts , 2010 .

[23]  Jimeng Sun,et al.  Dynamic Mixture Models for Multiple Time-Series , 2007, IJCAI.

[24]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[25]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[26]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[27]  Guan Yu,et al.  Document clustering via dirichlet process mixture model with feature selection , 2010, KDD.

[28]  M. de Rijke,et al.  Personalized search result diversification via structured learning , 2014, KDD.

[29]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[30]  M. de Rijke,et al.  Time-Aware Rank Aggregation for Microblog Search , 2014, CIKM.

[31]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[32]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[33]  Ashish V. Tendulkar,et al.  Comparative study of clustering techniques for short text documents , 2011, WWW.

[34]  Lijun Zhu,et al.  A Dynamic Users’ Interest Discovery Model with Distributed Inference Algorithm , 2014, Int. J. Distributed Sens. Networks.

[35]  Naonori Ueda,et al.  Topic Tracking Model for Analyzing Consumer Purchase Behavior , 2009, IJCAI.

[36]  M. de Rijke,et al.  Finding similar experts , 2007, SIGIR.

[37]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[38]  Ke Wang,et al.  Classification Pruning for Web-request Prediction , 2001, WWW Posters.

[39]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[40]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.