Sparse stochastic inference for latent Dirichlet allocation

We present a hybrid algorithm for Bayesian topic models that combines the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. We used our algorithm to analyze a corpus of 1.2 million books (33 billion words) with thousands of topics. Our approach reduces the bias of variational inference and generalizes to many Bayesian hidden-variable models.

[1]  É. Moulines,et al.  Convergence of a stochastic approximation version of the EM algorithm , 1999 .

[2]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  E. Kuhn,et al.  Coupling a stochastic approximation version of EM with an MCMC procedure , 2004 .

[5]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[6]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[8]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[9]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[10]  Eric Moulines,et al.  On‐line expectation–maximization algorithm for latent data models , 2007, ArXiv.

[11]  Max Welling,et al.  Asynchronous Distributed Learning of Topic Models , 2008, NIPS.

[12]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[13]  Wray L. Buntine Estimating Likelihoods for Topic Models , 2009, ACML.

[14]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[15]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[16]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[17]  Kristian Kersting,et al.  Larger Residuals, Less Work: Active Document Scheduling for Latent Dirichlet Allocation , 2011, ECML/PKDD.

[18]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[19]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.