Correlated Topic Models

Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [1]. We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets.

[1]  M. C. Jones,et al.  The Statistical Analysis of Compositional Data , 1986 .

[2]  S. Shen,et al.  The statistical analysis of compositional data , 1983 .

[3]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[4]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[5]  David J. Spiegelhalter,et al.  VIBES: A Variational Inference Engine for Bayesian Networks , 2002, NIPS.

[6]  Elena A. Erosheva,et al.  Grade of membership and latent structure models with application to disability survey data , 2002 .

[7]  Michael I. Jordan,et al.  A generalized mean field algorithm for variational inference in exponential families , 2002, UAI.

[8]  Ata Kabán,et al.  Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles , 2003, NIPS.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[11]  Benjamin M. Marlin,et al.  Collaborative Filtering: A Machine Learning Perspective , 2004 .

[12]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[13]  Joseph Y. Halpern,et al.  Proceedings of the 20th conference on Uncertainty in artificial intelligence , 2004, UAI 2004.

[14]  Andrew McCallum,et al.  The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email , 2005 .

[15]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[16]  Terrence J. Sejnowski,et al.  A Variational Principle for Graphical Models , 2007 .