The generalized dirichlet distribution in enhanced topic detection

We present a new, robust and computationally efficient Hierarchical Bayesian model for effective topic correlation modeling. We model the prior distribution of topics by a Generalized Dirichlet distribution (GD) rather than a Dirichlet distribution as in Latent Dirichlet Allocation (LDA). We define this model as GD-LDA. This framework captures correlations between topics, as in the Correlated Topic Model (CTM) and Pachinko Allocation Model (PAM), and is faster to infer than CTM and PAM. GD-LDA is effective to avoid over-fitting as the number of topics is increased. As a tree model, it accommodates the most important set of topics in the upper part of the tree based on their probability mass. Thus, GD-LDA provides the ability to choose significant topics effectively. To discover topic relationships, we perform hyper-parameter estimation based on Monte Carlo EM Estimation. We provide results using Empirical Likelihood(EL) in 4 public datasets from TREC and NIPS. Then, we present the performance of GD-LDA in ad hoc information retrieval (IR) based on MAP, P@10, and Discounted Gain. We discuss an empirical comparison of the fitting time. We demonstrate significant improvement over CTM, LDA, and PAM for EL estimation. For all the IR measures, GD-LDA shows higher performance than LDA, the dominant topic model in IR. All these improvements with a small increase in fitting time than LDA, as opposed to CTM and PAM.

[1]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[2]  Tzu-Tsung Wong,et al.  Parameter estimation for generalized Dirichlet distributions from the sample estimates of the first and the second moments of random variables , 2010, Comput. Stat. Data Anal..

[3]  Robert J. Connor,et al.  Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution , 1969 .

[4]  James Allan,et al.  Evaluating topic models for information retrieval , 2008, CIKM '08.

[5]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[6]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[7]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[8]  Thomas P. Minka,et al.  The Dirichlet-tree distribution , 2006 .

[9]  Charles Elkan,et al.  Accounting for burstiness in topic models , 2009, ICML '09.

[10]  Wei Li,et al.  Pachinko Allocation: Scalable Mixture Models of Topic Correlations , 2008 .

[11]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[12]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[13]  Wei Li,et al.  Nonparametric Bayes Pachinko Allocation , 2007, UAI.

[14]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  T. Minka Estimating a Dirichlet distribution , 2012 .

[17]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[18]  Tzu-Tsung Wong,et al.  Generalized Dirichlet distribution in Bayesian analysis , 1998, Appl. Math. Comput..

[19]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[20]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Samuel Y. Dennis,et al.  A Bayesian analysis of tree-structured statistical decision problems , 1996 .