Accounting for burstiness in topic models

Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once in a document, it is more likely to be used again. We introduce a topic model that uses Dirichlet compound multinomial (DCM) distributions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA). It is straightforward to incorporate the DCM extension into topic models that are more complex than LDA.

[1]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[2]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[4]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[6]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Eric P Xing,et al.  Mixed membership analysis of genome-wide expression data , 2007, 0711.2520.

[9]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[10]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[11]  Wei Li,et al.  Pachinko Allocation: Scalable Mixture Models of Topic Correlations , 2008 .

[12]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[13]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[14]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[15]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[16]  Gal Chechik,et al.  Euclidean Embedding of Co-occurrence Data , 2004, J. Mach. Learn. Res..

[17]  G. Celeux,et al.  Stochastic versions of the em algorithm: an experimental study in the mixture case , 1996 .