论文信息 - Accounting for burstiness in topic models

Accounting for burstiness in topic models

Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once in a document, it is more likely to be used again. We introduce a topic model that uses Dirichlet compound multinomial (DCM) distributions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA). It is straightforward to incorporate the DCM extension into topic models that are more complex than LDA.

Charles Elkan | Gabriel Doyle | C. Elkan | Gabriel Doyle

[1] John D. Lafferty,et al. Correlated Topic Models , 2005, NIPS.

[2] Mark Steyvers,et al. Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3] Gregor Heinrich. Parameter estimation for text analysis , 2009 .

[4] Pietro Perona,et al. A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5] David R. Karger,et al. Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[6] Thomas L. Griffiths,et al. Integrating Topics and Syntax , 2004, NIPS.

[7] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8] Eric P Xing,et al. Mixed membership analysis of genome-wide expression data , 2007, 0711.2520.

[9] Jorge Nocedal,et al. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[10] Kenneth Ward Church,et al. Poisson mixtures , 1995, Natural Language Engineering.

[11] Wei Li,et al. Pachinko Allocation: Scalable Mixture Models of Topic Correlations , 2008 .

[12] Wei Li,et al. Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[13] Charles Elkan,et al. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[14] David Kauchak,et al. Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[15] M. Newton. Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[16] Gal Chechik,et al. Euclidean Embedding of Co-occurrence Data , 2004, J. Mach. Learn. Res..

[17] G. Celeux,et al. Stochastic versions of the em algorithm: an experimental study in the mixture case , 1996 .