Pachinko allocation: DAG-structured mixture models of topic correlations

Latent Dirichlet allocation (LDA) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. However, LDA does not capture correlations between topics. In this paper, we introduce the pachinko allocation model (PAM), which captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). The leaves of the DAG represent individual words in the vocabulary, while each interior node represents a correlation among its children, which may be words or other interior nodes (topics). PAM provides a flexible alternative to recent work by Blei and Lafferty (2006), which captures correlations only between pairs of topics. Using text data from newsgroups, historic NIPS proceedings and other research paper corpora, we show improved performance of PAM in document classification, likelihood of held-out data, the ability to support finer-grained topics, and topical keyword coherence.

[1]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[2]  P. Diggle,et al.  Monte Carlo Methods of Inference for Implicit Statistical Models , 1984 .

[3]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[4]  S. Chib Marginal Likelihood from the Gibbs Output , 1995 .

[5]  Arnold L. Rosenberg,et al.  Finding topic words for hierarchical summarization , 2001, SIGIR '01.

[6]  Daphne Koller,et al.  Continuous Time Bayesian Networks , 2012, UAI.

[7]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[10]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[11]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[12]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Ching-Yung Lin,et al.  Modeling and predicting personal information dissemination behavior , 2005, KDD '05.

[14]  Andrew McCallum,et al.  Group and topic discovery from relations and text , 2005, LinkKDD '05.

[15]  Wei Li,et al.  Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[16]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[17]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[18]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[19]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[20]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .