Stochastic Gradient MCMC: Algorithms and Applications

Despite the powerful advantages of Bayesian inference such as quantifying uncertainty, ac- curate averaged prediction, and preventing overfitting, the traditional Markov chain Monte Carlo (MCMC) method has been regarded unsuitable for large-scale problems because it required processing the entire dataset per iteration rather than using a small random mini- batch as performed in the stochastic gradient optimization. The first attempt toward the scalable MCMC method based on stochastic gradients is the stochastic gradient Langevin dynamics (SGLD) proposed by Welling and Teh [2011]. Originated from the Langevin Monte Carlo method, SGLD achieved O(n) computation per iteration (here, n is the size of a minibatch) by using stochastic gradients estimated using minibatches and skipping the Metropolis-Hastings accept-reject test.In this thesis, we introduce recent advances in the stochastic gradient MCMC method since the advent of SGLD. Our contributions are two-fold: algorithms and applications. In the algorithm part, we first propose the stochastic gradient Fisher scoring algorithm (SGFS) which resolves two drawbacks of SGLD: the poor mixing rate and the arbitrarily large bias occurred when using large step sizes. Then, we also propose the distributed SGLD (D-SGLD) algorithm which makes it possible to extend the power of stochastic gradient MCMC to the distributed computing systems. In the second part, we apply the developed SG-MCMC algorithms to the most popular large-scale problems: the topic modeling using the latent Dirichlet allocation model, recommender systems using matrix factorization, and community modeling in social networks using mixed membership stochastic blockmodels. By resolving the unique challenges raised by each of the applications, which make it difficult to directly use the existing SG-MCMC methods, we obtain the-state-of-the-art results outperforming existing approaches using collapsed Gibbs sampling, stochastic variational inference, or dis- tributed stochastic gradient descent.

[1]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[4]  Arnaud Doucet,et al.  Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach , 2014, ICML.

[5]  Ahn,et al.  Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[6]  Anoop Korattikara Balan Approximate Markov Chain Monte Carlo Algorithms for Large Scale Bayesian Inference , 2014 .

[7]  Ryan Babbush,et al.  Bayesian Sampling Using Stochastic Gradient Thermostats , 2014, NIPS.

[8]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[9]  M. Seeger Low Rank Updates for the Cholesky Decomposition , 2004 .

[10]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[11]  H. Robbins A Stochastic Approximation Method , 1951 .

[12]  Rainer Gemulla,et al.  Distributed Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[13]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[14]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[15]  Radford M. Neal,et al.  High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees , 2006, Feature Extraction.

[16]  M. Girolami,et al.  Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[17]  Babak Shahbaba,et al.  Distributed Stochastic Gradient MCMC , 2014, ICML.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[20]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[21]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[22]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[23]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[24]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[25]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[26]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[27]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[28]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[29]  Yee Whye Teh,et al.  Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex , 2013, NIPS.

[30]  Christopher Ré,et al.  Parallel stochastic gradient algorithms for large-scale matrix completion , 2013, Mathematical Programming Computation.

[31]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[32]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[33]  Gideon S. Mann,et al.  MapReduce/Bigtable for Distributed Optimization , 2010 .

[34]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[35]  Kathryn B. Laskey,et al.  Population Markov Chain Monte Carlo , 2004, Machine Learning.

[36]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[37]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[38]  W. A. Scott Maximum likelihood estimation using the empirical fisher information matrix , 2002 .

[39]  Michael W Deem,et al.  Parallel tempering: theory, applications, and new perspectives. , 2005, Physical chemistry chemical physics : PCCP.

[40]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Ryan P. Adams,et al.  Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes , 2010, UAI.

[42]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[43]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[44]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[45]  Hiroshi Nakagawa,et al.  Approximation Analysis of Stochastic Gradient Langevin Dynamics by using Fokker-Planck Equation and Ito Process , 2014, ICML.

[46]  Marimuthu Palaniswami,et al.  Internet of Things (IoT): A vision, architectural elements, and future directions , 2012, Future Gener. Comput. Syst..

[47]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[48]  M. Girolami Riemann Manifold Langevin and Hamiltonian Monte Carlo , 2010 .

[49]  James Bennett,et al.  The Netflix Prize , 2007 .

[50]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[51]  Yehuda Koren,et al.  The Yahoo! Music Dataset and KDD-Cup '11 , 2012, KDD Cup.

[52]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[53]  V. Borkar Stochastic approximation with two time scales , 1997 .

[54]  Darren J. Wilkinson,et al.  Parallel Bayesian Computation , 2005 .

[55]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[56]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[57]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[58]  Max Welling,et al.  Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures , 2010, AAAI.

[59]  Max Welling,et al.  Distributed and Adaptive Darting Monte Carlo through Regenerations , 2013, AISTATS.

[60]  Michael J. Freedman,et al.  Scalable Inference of Overlapping Communities , 2012, NIPS.

[61]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[62]  Christophe Andrieu,et al.  A tutorial on adaptive MCMC , 2008, Stat. Comput..

[63]  Max Welling,et al.  Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget , 2013, ICML 2014.

[64]  Chih-Jen Lin,et al.  A fast parallel SGD for matrix factorization in shared memory systems , 2013, RecSys.