Group topic model: organizing topics into groups

Abstract Latent Dirichlet allocation defines hidden topics to capture latent semantics in text documents. However, it assumes that all the documents are represented by the same topics, resulting in the “forced topic” problem. To solve this problem, we developed a group latent Dirichlet allocation (GLDA). GLDA uses two kinds of topics: local topics and global topics. The highly related local topics are organized into groups to describe the local semantics, whereas the global topics are shared by all the documents to describe the background semantics. GLDA uses variational inference algorithms for both offline and online data. We evaluated the proposed model for topic modeling and document clustering. Our experimental results indicated that GLDA can achieve a competitive performance when compared with state-of-the-art approaches.

[1]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[2]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[3]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[4]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[5]  Dan Zhang,et al.  Document clustering with universum , 2011, SIGIR.

[6]  I JordanMichael,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2010 .

[7]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[8]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[9]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[10]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[12]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[13]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[14]  Bo Thiesson,et al.  Markov Topic Models , 2009, AISTATS.

[15]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[16]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[17]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[18]  Charles Elkan,et al.  Accounting for burstiness in topic models , 2009, ICML '09.

[19]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[20]  David M. Blei,et al.  Hierarchical relational models for document networks , 2009, 0909.4331.

[21]  Bryan Silverthorn,et al.  Spherical Topic Models , 2010, ICML.

[22]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[25]  Sunil Arya,et al.  Space-time tradeoffs for approximate nearest neighbor searching , 2009, JACM.

[26]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[27]  David M. Blei,et al.  Syntactic Topic Models , 2008, NIPS.

[28]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[29]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[30]  L. Lovász Matching Theory (North-Holland mathematics studies) , 1986 .

[31]  Hanna Wallach,et al.  Structured Topic Models for Language , 2008 .

[32]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[33]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[34]  Alexei A. Efros,et al.  Unsupervised discovery of visual object class hierarchies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.