Document Clustering and Topic Modeling: A Unified Bayesian Probabilistic Perspective

Document clustering and topic modeling are fundamental tasks in text mining, that can be unified to reciprocally enhance each other. In this paper, we present a machine learning approach to the joint modeling and interdependent fulfilment of both tasks. In particular, document clustering and topic modeling are seamlessly interrelated under an innovative Bayesian generative model of clusters, topics and contents in text corpora. Such a model assumes that text corpora result from a generative process, in which clusters and topics act as connected latent factors. Essentially, clusters are initially associated with descriptive and actionable topic distributions, that enforce cluster coherence. The individual documents are then assigned to one respective cluster and worded accordingly. Under the devised model, document clustering and topic modeling can be simultaneously performed in an interdependent manner simply by Bayesian reasoning. For this purpose, the mathematical details regarding collapsed Gibbs sampling as well as parameter estimation are derived and implemented into an approximate inference algorithm. Comparative experiments on standard benchmark text corpora reveal the effectiveness of our approach at jointly clustering text documents and unveiling their semantics in terms of coherent topics.

[1]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[2]  R. Ortale,et al.  Model-Based Collaborative Personalized Recommendation on Signed Social Rating Networks , 2016, ACM Trans. Internet Techn..

[3]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[4]  Gianni Costa,et al.  Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure , 2019, PAKDD.

[5]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[6]  Dennis V. Lindley,et al.  An Introduction to Bayesian Inference and Decision , 1974 .

[7]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[10]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[11]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[12]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[13]  Linda C. van der Gaag,et al.  Probabilistic Graphical Models , 2014, Lecture Notes in Computer Science.

[14]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[15]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[16]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[17]  Gianni Costa,et al.  Mining Overlapping Communities and Inner Role Assignments through Bayesian Mixed-Membership Models of Networks with Context-Dependent Interactions , 2018, ACM Trans. Knowl. Discov. Data.

[18]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[19]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[20]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email , 2007, J. Artif. Intell. Res..

[21]  Hui Xiong,et al.  Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.

[22]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[23]  Gianni Costa,et al.  Marrying Community Discovery and Role Analysis in Social Media via Topic Modeling , 2018, PAKDD.

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Gianni Costa,et al.  XML Clustering by Structure-Constrained Phrases: A Fully-Automatic Approach Using Contextualized N-Grams , 2017, Int. J. Artif. Intell. Tools.

[26]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[27]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[28]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[29]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[30]  Robert L. Winkler,et al.  An Introduction to Bayesian Inference and Decision , 1972 .

[31]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[32]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33]  Gianni Costa,et al.  Machine learning techniques for XML (co-)clustering by structure-constrained phrases , 2018, Information Retrieval Journal.