Inferring Dynamic User Interests in Streams of Short Texts for User Clustering

User clustering has been studied from different angles. In order to identify shared interests, behavior-based methods consider similar browsing or search patterns of users, whereas content-based methods use information from the contents of the documents visited by the users. So far, content-based user clustering has mostly focused on static sets of relatively long documents. Given the dynamic nature of social media, there is a need to dynamically cluster users in the context of streams of short texts. User clustering in this setting is more challenging than in the case of long documents, as it is difficult to capture the users’ dynamic topic distributions in sparse data settings. To address this problem, we propose a dynamic user clustering topic model (UCT). UCT adaptively tracks changes of each user’s time-varying topic distributions based both on the short texts the user posts during a given time period and on previously estimated distributions. To infer changes, we propose a Gibbs sampling algorithm where a set of word pairs from each user is constructed for sampling. UCT can be used in two ways: (1) as a short-term dependency model that infers a user’s current topic distribution based on the user’s topic distributions during the previous time period only, and (2) as a long-term dependency model that infers a user’s current topic distributions based on the user’s topic distributions during multiple time periods in the past. The clustering results are explainable and human-understandable, in contrast to many other clustering algorithms. For evaluation purposes, we work with a dataset consisting of users and tweets from each user. Experimental results demonstrate the effectiveness of our proposed short-term and long-term dependency user clustering models compared to state-of-the-art baselines.

[1]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[3]  Michael J. Paul,et al.  Summarizing Contrastive Viewpoints in Opinionated Text , 2010, EMNLP.

[4]  Jun Zhang,et al.  Dirichlet Process Mixture Model for Document Clustering with Feature Partition , 2013, IEEE Transactions on Knowledge and Data Engineering.

[5]  Ari Rappoport,et al.  Efficient Clustering of Short Messages into General Domains , 2013, ICWSM.

[6]  M. de Rijke,et al.  Time-Aware Rank Aggregation for Microblog Search , 2014, CIKM.

[7]  Vincent Ng,et al.  Narrowing the Modeling Gap: A Cluster-Ranking Approach to Coreference Resolution , 2014, J. Artif. Intell. Res..

[8]  Aristides Gionis,et al.  Query similarity by projecting the query-flow graph , 2010, SIGIR.

[9]  Naonori Ueda,et al.  Topic Tracking Model for Analyzing Consumer Purchase Behavior , 2009, IJCAI.

[10]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[11]  M. de Rijke,et al.  Finding similar experts , 2007, SIGIR.

[12]  Eric P. Xing,et al.  Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream , 2010, UAI.

[13]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[15]  Bing Liu,et al.  Mining topics in documents: standing on the shoulders of big data , 2014, KDD.

[16]  Evangelos Kanoulas,et al.  Dynamic Clustering of Streaming Short Documents , 2016, KDD.

[17]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[18]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[19]  W. Bruce Croft,et al.  User oriented tweet ranking: a filtering approach to microblogs , 2011, CIKM '11.

[20]  Ee-Peng Lim,et al.  Finding Bursty Topics from Microblogs , 2012, ACL.

[21]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[22]  Min-Yen Kan,et al.  Comment-based multi-view clustering of web 2.0 items , 2014, WWW.

[23]  Sean Gerrish,et al.  A Language-based Approach to Measuring Scholarly Impact , 2010, ICML.

[24]  Guan Yu,et al.  Document clustering via dirichlet process mixture model with feature selection , 2010, KDD.

[25]  M. de Rijke,et al.  Personalized search result diversification via structured learning , 2014, KDD.

[26]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[27]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[28]  Ke Wang,et al.  Classification Pruning for Web-request Prediction , 2001, WWW Posters.

[29]  Jordan Boyd-Graber,et al.  Online Latent Dirichlet Allocation with Infinite Vocabulary , 2013, ICML.

[30]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[31]  Jure Leskovec,et al.  Learning to Discover Social Circles in Ego Networks , 2012, NIPS.

[32]  M. de Rijke,et al.  Personalized time-aware tweets summarization , 2013, SIGIR.

[33]  M. de Rijke,et al.  Fusion helps diversification , 2014, SIGIR.

[34]  Ashish V. Tendulkar,et al.  Comparative study of clustering techniques for short text documents , 2011, WWW.

[35]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[36]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[37]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[38]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[39]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[40]  Ben Taskar,et al.  Discovering Diverse and Salient Threads in Document Collections , 2012, EMNLP.

[41]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[42]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[43]  Jimeng Sun,et al.  Dynamic Mixture Models for Multiple Time-Series , 2007, IJCAI.

[44]  Ben Taskar,et al.  Structured Determinantal Point Processes , 2010, NIPS.

[45]  Viktor K. Prasanna,et al.  Social Link Prediction in Online Social Tagging Systems , 2013, TOIS.

[46]  M. de Rijke,et al.  Burst-aware data fusion for microblog search , 2015, Inf. Process. Manag..

[47]  M. de Rijke,et al.  Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes , 2015, SIGIR.

[48]  Maarten de Rijke,et al.  Efficient Structured Learning for Personalized Diversification , 2016, IEEE Transactions on Knowledge and Data Engineering.

[49]  Milad Shokouhi,et al.  Learning to personalize query auto-completion , 2013, SIGIR.

[50]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Kenneth Wai-Ting Leung,et al.  Collaborative personalized Twitter search with topic-language models , 2014, SIGIR.

[52]  Jaideep Srivastava,et al.  Creating adaptive Web sites through usage-based clustering of URLs , 1999, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).

[53]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[54]  Xianpei Han,et al.  An Entity-Topic Model for Entity Linking , 2012, EMNLP.

[55]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[56]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[57]  Yan Zhang,et al.  User Based Aggregation for Biterm Topic Model , 2015, ACL.

[58]  Jie Yin,et al.  Clustering Microtext Streams for Event Identification , 2013, IJCNLP.

[59]  Mark Dredze,et al.  Entity Clustering Across Languages , 2012, NAACL.

[60]  T. Minka Estimating a Dirichlet distribution , 2012 .

[61]  Lijun Zhu,et al.  A Dynamic Users’ Interest Discovery Model with Distributed Inference Algorithm , 2014, Int. J. Distributed Sens. Networks.

[62]  Lora Aroyo,et al.  Time-aware Multi-Viewpoint Summarization of Multilingual Social Text Streams , 2016, CIKM.

[63]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[64]  Hao Yu,et al.  Structure-Aware Review Mining and Summarization , 2010, COLING.

[65]  Ryen W. White,et al.  Large-scale analysis of individual and task differences in search result page examination strategies , 2012, WSDM '12.

[66]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[67]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[68]  M. de Rijke,et al.  Hierarchical multi-label classification of social text streams , 2014, SIGIR.

[69]  M. de Rijke,et al.  Explainable User Clustering in Short Text Streams , 2016, SIGIR.

[70]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[71]  Krishna P. Gummadi,et al.  You are who you know: inferring user profiles in online social networks , 2010, WSDM '10.

[72]  Wei Gao,et al.  From classification to quantification in tweet sentiment analysis , 2016, Social Network Analysis and Mining.

[73]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[74]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[75]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[76]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[77]  Maarten de Rijke,et al.  Contextual factors for finding similar experts , 2010, J. Assoc. Inf. Sci. Technol..

[78]  M. de Rijke,et al.  Learning Latent Vector Spaces for Product Search , 2016, CIKM.

[79]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[80]  Marcel Worring,et al.  Unsupervised, Efficient and Semantic Expertise Retrieval , 2016, WWW.

[81]  Jing Jiang,et al.  Recurrent Chinese Restaurant Process with a Duration-based Discount for Event Identification from Twitter , 2014, SDM.

[82]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[83]  L. Hubert,et al.  Comparing partitions , 1985 .