Jointly modeling and simultaneously discovering topics and clusters in text corpora using word vectors

Abstract An innovative model-based approach to coupling text clustering and topic modeling is introduced, in which the two tasks take advantage of each other. Specifically, the integration is enabled by a new generative model of text corpora. This explains topics, clusters and document content via a Bayesian generative process. In this process, documents include word vectors, to capture the (syntactic and semantic) regularities among words. Topics are multivariate Gaussian distributions on word vectors. Clusters are assigned corresponding topic distributions as their semantics. Content generation is ruled by text clusters and topics, which act as interacting latent factors. Documents are at first placed into respective clusters, then the semantics of these clusters is then repeatedly sampled to draw document topics, which are in turn sampled for word-vector generation. Under the proposed model, collapsed Gibbs sampling is derived mathematically and implemented algorithmically with parameter estimation for the simultaneous inference of text clusters and topics. A comparative assessment on real-world benchmark corpora demonstrates the effectiveness of this approach in clustering texts and uncovering their semantics. Intrinsic and extrinsic criteria are adopted to investigate its topic modeling performance, whose results are shown through a case study. Time efficiency and scalability are also studied.

[1]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[2]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[3]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[4]  Chunyan Miao,et al.  Generative Topic Embedding: a Continuous Representation of Documents , 2016, ACL.

[5]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[6]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[7]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[8]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[9]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[10]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[11]  Alexander J. Smola,et al.  Discovering geographical topics in the twitter stream , 2012, WWW.

[12]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[17]  David M. Mimno,et al.  Applications of Topic Models , 2017, Found. Trends Inf. Retr..

[18]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[20]  Dennis V. Lindley,et al.  An Introduction to Bayesian Inference and Decision , 1974 .

[21]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[22]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[23]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[24]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[25]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[26]  B. M. Hill,et al.  Bayesian Inference in Statistical Analysis , 1974 .

[27]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[28]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[29]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[30]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[31]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[32]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[33]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[34]  Junghoo Cho,et al.  Social-network analysis using topic models , 2012, SIGIR '12.

[35]  Gianni Costa,et al.  Marrying Community Discovery and Role Analysis in Social Media via Topic Modeling , 2018, PAKDD.

[36]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[37]  Aidong Zhang,et al.  A Correlated Topic Model Using Word Embeddings , 2017, IJCAI.