Parallelizing LDA using Partially Collapsed Gibbs Sampling

Latent dirichlet allocation (LDA) is a model widely used for unsupervised probabilistic modeling of text and images. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler that integrates out all model parameters except the topic indicators for each word. The topic indicators are Gibbs sampled iteratively by drawing each topic from its conditional posterior. The popularity of this sampler stems from its balanced combination of simplicity and efficiency, but its inherently sequential nature is an obstacle for parallel implementations. Growing corpus sizes and increasing model complexity are making inference in LDA models computationally infeasible without parallel sampling. We propose a parallel implementation of LDA that only collapses over the topic proportions in each document and therefore allows independent sampling of the topic indicators in different documents. We develop several modifications of the basic algorithm that exploits sparsity and structure to further improve the performance of the partially collapsed sampler. Contrary to other parallel LDA implementations, the partially collapsed sampler guarantees convergence to the true posterior. We show on several well-known corpora that the expected increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speed-up from parallelization for larger corpora.

[1]  George Marsaglia,et al.  A simple method for generating gamma variables , 2000, TOMS.

[2]  Feng Yan,et al.  Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units , 2009, NIPS.

[3]  Jen-Tzung Chien,et al.  Bayesian Sparse Topic Model , 2013, Journal of Signal Processing Systems.

[4]  Ning Chen,et al.  Gibbs Max-Margin Topic Models with Fast Sampling Algorithms , 2013, ICML.

[5]  Alexander T. Ihler,et al.  Understanding Errors in Approximate Distributed Latent Dirichlet Allocation , 2012, IEEE Transactions on Knowledge and Data Engineering.

[6]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[7]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[8]  Inderjit S. Dhillon,et al.  A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[9]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[10]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[11]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[12]  Jun S. Liu,et al.  Covariance Structure and Convergence Rate of the Gibbs Sampler with Various Scans , 1995 .

[13]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[14]  R. Kohn,et al.  Regression Density Estimation Using Smooth Adaptive Gaussian Mixtures , 2007 .

[15]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[16]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[17]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[18]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[19]  Max Welling,et al.  Asynchronous Distributed Learning of Topic Models , 2008, NIPS.

[20]  J. Rosenthal,et al.  Adaptive Gibbs samplers and related MCMC methods , 2011, 1101.5838.

[21]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.