Model-based Clustering of Short Text Streams

Short text stream clustering has become an increasingly important problem due to the explosive growth of short text in diverse social medias. In this paper, we propose a model-based short text stream clustering algorithm (MStream) which can deal with the concept drift problem and sparsity problem naturally. The MStream algorithm can achieve state-of-the-art performance with only one pass of the stream, and can have even better performance when we allow multiple iterations of each batch. We further propose an improved algorithm of MStream with forgetting rules called MStreamF, which can efficiently delete outdated documents by deleting clusters of outdated batches. Our extensive experimental study shows that MStream and MStreamF can achieve better performance than three baselines on several real datasets.

[1]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[2]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[3]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[4]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[5]  Massih-Reza Amini,et al.  Streaming-LDA: A Copula-based Approach to Modeling Topic Dependencies in Document Streams , 2016, KDD.

[6]  Argyris Kalogeratos,et al.  Improving Text Stream Clustering using Term Burstiness and Co-burstiness , 2016, SETN.

[7]  Nando de Freitas,et al.  An Introduction to Sequential Monte Carlo Methods , 2001, Sequential Monte Carlo Methods in Practice.

[8]  Jimeng Sun,et al.  Dynamic Mixture Models for Multiple Time-Series , 2007, IJCAI.

[9]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[10]  Eric P. Xing,et al.  Dynamic Non-Parametric Mixture Models and the Recurrent Chinese Restaurant Process: with Applications to Evolutionary Clustering , 2008, SDM.

[11]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[12]  Charu C. Aggarwal,et al.  A Survey of Stream Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.

[13]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[14]  Jianhua Yin,et al.  A model-based approach for text clustering with outlier detection , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[15]  Le Song,et al.  Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams , 2015, KDD.

[16]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[17]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[18]  Alireza Rezaei Mahdiraji Clustering data stream: A survey of algorithms , 2009, Int. J. Knowl. Based Intell. Eng. Syst..

[19]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[20]  Hao Huang,et al.  Streaming spectral clustering , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[21]  Philip S. Yu,et al.  Under Consideration for Publication in Knowledge and Information Systems on Clustering Massive Text and Categorical Data Streams , 2022 .

[22]  Wee Keong Ng,et al.  A survey on data stream clustering and classification , 2015, Knowledge and Information Systems.

[23]  Eugene Agichtein,et al.  TM-LDA: efficient online modeling of latent topic transitions in social media , 2012, KDD.

[24]  Zhenhua Wang,et al.  Sumblr: continuous summarization of evolving tweet streams , 2013, SIGIR.

[25]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[26]  Evangelos Kanoulas,et al.  Dynamic Clustering of Streaming Short Documents , 2016, KDD.

[27]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Freda Kemp,et al.  An Introduction to Sequential Monte Carlo Methods , 2003 .

[30]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[31]  Naonori Ueda,et al.  Topic Tracking Model for Analyzing Consumer Purchase Behavior , 2009, IJCAI.

[32]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[33]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..