Discovering cluster evolution patterns with the Cluster Association-aware matrix factorization

Tracking of document collections over time (or across domains) is helpful in several applications such as finding dynamics of terminologies, identifying emerging and evolving trends, and concept drift detection. We propose a novel ‘Cluster Association-aware’ Non-negative Matrix Factorization (NMF)-based method with graph-based visualization to identify the changing dynamics of text clusters over time/domains. NMF is utilized to find similar clusters in the set of clustering solutions. Based on the similarities, four major lifecycle states of clusters, namely birth, split, merge and death, are tracked to discover their emergence, growth, persistence and decay. The novel concepts of ‘cluster associations’ and term frequency-based ‘cluster density’ have been used to improve the quality of evolution patterns. The cluster evolution is visualized using a k-partite graph. Empirical analysis with the text data shows that the proposed method is able to produce accurate and efficient solution as compared to the state-of-the-art methods.

[1]  Vikas Sindhwani,et al.  Emerging topic detection using dictionary learning , 2011, CIKM '11.

[2]  Eleftherios Mylonakis,et al.  Google trends: a web-based tool for real-time surveillance of disease outbreaks. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[3]  C. Lee Giles,et al.  Topic and Trend Detection in Text Collections Using Latent Dirichlet Allocation , 2009, ECIR.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  George Karypis,et al.  Document Clustering: The Next Frontier , 2018, Data Clustering: Algorithms and Applications.

[6]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[7]  Le Song,et al.  Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams , 2015, KDD.

[8]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[9]  Richi Nayak,et al.  Corpus-Based Augmented Media Posts with Density-Based Clustering for Community Detection , 2018, 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI).

[10]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[11]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[12]  Derek Greene,et al.  Stability of topic modeling via matrix factorization , 2017, Expert Syst. Appl..

[13]  Xiaomo Liu,et al.  Real-Time Novel Event Detection from Social Media , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[14]  Yun Chi,et al.  Facetnet: a framework for analyzing communities and their evolutions in dynamic networks , 2008, WWW.

[15]  Yihong Gong,et al.  Detecting communities and their evolutions in dynamic social networks—a Bayesian approach , 2011, Machine Learning.

[16]  Jiawei Han,et al.  A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks , 2009, Proc. VLDB Endow..

[17]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[18]  Laks V. S. Lakshmanan,et al.  Incremental cluster evolution tracking from highly dynamic network data , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[19]  Jose-Luis Hervas-Oliver,et al.  Clusters and Industrial Districts: Where is the Literature Going? Identifying Emerging Sub-Fields of Research , 2015 .

[20]  Nikos D. Sidiropoulos,et al.  Non-Negative Matrix Factorization Revisited: Uniqueness and Algorithm for Symmetric Decomposition , 2014, IEEE Transactions on Signal Processing.

[21]  Tal Galili,et al.  dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering , 2015, Bioinform..

[22]  Peng Zhang,et al.  Mining streams of short text for analysis of world-wide event evolutions , 2014, World Wide Web.

[23]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[24]  Christo Kirov,et al.  A Temporal Topic Model for Noisy Mediums , 2018, PAKDD.

[25]  Huidong Jin,et al.  Sequential latent Dirichlet allocation , 2012, Knowledge and Information Systems.

[26]  Dominik Olszewski,et al.  Fraud detection using self-organizing map visualizing the user profiles , 2014, Knowl. Based Syst..

[27]  Peter Filzmoser,et al.  Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection , 2018, Comput. Secur..

[28]  Paulo Rita,et al.  Research trends on Big Data in Marketing: A text mining and topic modeling based literature analysis , 2018, European Research on Management and Business Economics.

[29]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[30]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[31]  Zuowei Shen,et al.  Dictionary Learning for Sparse Coding: Algorithms and Convergence Analysis , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Richi Nayak,et al.  PaperMiner - a real-time spatiotemporal visualization for newspaper articles , 2020, Digit. Scholarsh. Humanit..

[33]  Derek Greene,et al.  Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach , 2016, Political Analysis.

[34]  Haesun Park,et al.  Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework , 2014, J. Glob. Optim..

[35]  Yanchun Zhang,et al.  GEAM: A General and Event-Related Aspects Model for Twitter Event Detection , 2013, WISE.

[36]  Maria Riveiro,et al.  A comparative user study of visualization techniques for cluster analysis of multidimensional data sets , 2020, Inf. Vis..

[37]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[38]  Jaegul Choo,et al.  Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.