Distributing the Stochastic Gradient Sampler for Large-Scale LDA

Learning large-scale Latent Dirichlet Allocation (LDA) models is beneficial for many applications that involve large collections of documents.Recent work has been focusing on developing distributed algorithms in the batch setting, while leaving stochastic methods behind, which can effectively explore statistical redundancy in big data and thereby are complementary to distributed computing.The distributed stochastic gradient Langevin dynamics (DSGLD) represents one attempt to combine stochastic sampling and distributed computing, but it suffers from drawbacks such as excessive communications and sensitivity to partitioning of datasets across nodes. DSGLD is typically limited to learn small models that have about 103 topics and $10^3$ vocabulary size. In this paper, we present embarrassingly parallel SGLD (EPSGLD), a novel distributed stochastic gradient sampling method for topic models. Our sampler is built upon a divide-and-conquer architecture which enables us to produce robust and asymptotically exact samples with less communication overhead than DSGLD. We further propose several techniques to reduce the overhead in I/O and memory usage. Experiments on Wikipedia and ClueWeb12 documents demonstrate that, EPSGLD can scale up to large models with 1010 parameters (i.e., 105 topics, 105 vocabulary size), four orders of magnitude larger than DSGLD, and converge faster.

[1]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[2]  Jun Zhu,et al.  Big Learning with Bayesian Methods , 2014, ArXiv.

[3]  Yee Whye Teh,et al.  Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex , 2013, NIPS.

[4]  Ralf Krestel,et al.  Latent dirichlet allocation for tag recommendation , 2009, RecSys '09.

[5]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[6]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[7]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[8]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[9]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[10]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[11]  Xiangyu Wang,et al.  Parallel MCMC via Weierstrass Sampler , 2013, ArXiv.

[12]  David B. Dunson,et al.  Robust and Scalable Bayes via a Median of Subset Posterior Measures , 2014, J. Mach. Learn. Res..

[13]  Yang Gao,et al.  Towards Topic Modeling for Big Data , 2014, ArXiv.

[14]  Ning Chen,et al.  Discriminative Relational Topic Models , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[16]  Chong Wang,et al.  Asymptotically Exact, Embarrassingly Parallel MCMC , 2013, UAI.

[17]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models , 2012, J. Mach. Learn. Res..

[18]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[19]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[20]  Zhiyuan Liu,et al.  PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing , 2011, TIST.

[21]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[22]  Babak Shahbaba,et al.  Distributed Stochastic Gradient MCMC , 2014, ICML.

[23]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[24]  Wenguang Chen,et al.  WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation , 2015, Proc. VLDB Endow..

[25]  Yee Whye Teh,et al.  Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics , 2014, J. Mach. Learn. Res..

[26]  Eric P. Xing,et al.  Model-Parallel Inference for Big Topic Models , 2014, ArXiv.