Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models

ABSTRACT Topic models, and more specifically the class of latent Dirichlet allocation (LDA), are widely used for probabilistic modeling of text. Markov chain Monte Carlo (MCMC) sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler. Supplementary materials for this article are available online.

[1]  Joseph Tassarotti,et al.  Augur: Data-Parallel Probabilistic Modeling , 2014, NIPS.

[2]  J. Rosenthal,et al.  Adaptive Gibbs samplers and related MCMC methods , 2011, 1101.5838.

[3]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[4]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[5]  Zhiyuan Liu,et al.  PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing , 2011, TIST.

[6]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[7]  G. Tian,et al.  Dirichlet and Related Distributions: Theory, Methods and Applications , 2011 .

[8]  Jianfeng Gao,et al.  A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers , 2008, EMNLP.

[9]  Samuel Reese,et al.  Word-sense disambiguated multilingual Wikipedia corpus , 2010, LREC 2010.

[10]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[11]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[12]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[13]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[15]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[16]  George Marsaglia,et al.  A simple method for generating gamma variables , 2000, TOMS.

[17]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[18]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[19]  Alexander T. Ihler,et al.  Understanding Errors in Approximate Distributed Latent Dirichlet Allocation , 2012, IEEE Transactions on Knowledge and Data Engineering.

[20]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[21]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[22]  Ning Chen,et al.  Gibbs Max-Margin Topic Models with Fast Sampling Algorithms , 2013, ICML.

[23]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[24]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[25]  Feng Yan,et al.  Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units , 2009, NIPS.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[28]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[29]  Robert Östling,et al.  Bayesian Models for Multilingual Word Alignment , 2015 .

[30]  Babak Shahbaba,et al.  Distributed Stochastic Gradient MCMC , 2014, ICML.

[31]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[32]  R. Kohn,et al.  Regression Density Estimation Using Smooth Adaptive Gaussian Mixtures , 2007 .

[33]  Thomas Stibor,et al.  Efficient Collapsed Gibbs Sampling for Latent Dirichlet Allocation , 2010, ACML.

[34]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[35]  Max Welling,et al.  Asynchronous Distributed Learning of Topic Models , 2008, NIPS.

[36]  Jen-Tzung Chien,et al.  Bayesian Sparse Topic Model , 2013, Journal of Signal Processing Systems.

[37]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[38]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[39]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[40]  Song-Chun Zhu,et al.  First Hitting Time Analysis of the Independence Metropolis Sampler , 2006 .

[41]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[42]  Inderjit S. Dhillon,et al.  A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[43]  Jun S. Liu,et al.  Covariance Structure and Convergence Rate of the Gibbs Sampler with Various Scans , 1995 .