AMAGOLD: Amortized Metropolis Adjustment for Efficient Stochastic Gradient MCMC

Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is an efficient method for sampling from continuous distributions. It is a faster alternative to HMC: instead of using the whole dataset at each iteration, SGHMC uses only a subsample. This improves performance, but introduces bias that can cause SGHMC to converge to the wrong distribution. One can prevent this using a step size that decays to zero, but such a step size schedule can drastically slow down convergence. To address this tension, we propose a novel second-order SG-MCMC algorithm---AMAGOLD---that infrequently uses Metropolis-Hastings (M-H) corrections to remove bias. The infrequency of corrections amortizes their cost. We prove AMAGOLD converges to the target distribution with a fixed, rather than a diminishing, step size, and that its convergence rate is at most a constant factor slower than a full-batch baseline. We empirically demonstrate AMAGOLD's effectiveness on synthetic distributions, Bayesian logistic regression, and Bayesian neural networks.

[1]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[2]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[3]  A. Horowitz A generalized guided Monte Carlo algorithm , 1991 .

[4]  Michael I. Miller,et al.  REPRESENTATIONS OF KNOWLEDGE IN COMPLEX SYSTEMS , 1994 .

[5]  R. Tweedie,et al.  Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[6]  J. Rosenthal,et al.  Optimal scaling of discrete approximations to Langevin diffusions , 1998 .

[7]  S. Aida Uniform Positivity Improving Property, Sobolev Inequalities, and Spectral Gaps , 1998 .

[8]  Kluwer Academic Publishers Methodology and computing in applied probability , 1999 .

[9]  R. Tweedie,et al.  Langevin-Type Models I: Diffusions with Given Stationary Distributions and their Discretizations* , 1999 .

[10]  G. Roberts,et al.  Langevin Diffusions and Metropolis-Hastings Algorithms , 2002 .

[11]  Michael Chertkov,et al.  Irreversible Monte Carlo Algorithms for Efficient Sampling , 2008, ArXiv.

[12]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[13]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[14]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[15]  D. Rudolf,et al.  Explicit error bounds for Markov chain Monte Carlo , 2011, 1108.3201.

[16]  Ahn,et al.  Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[17]  V. Climenhaga Markov chains and mixing times , 2013 .

[18]  K. Hukushima,et al.  An irreversible Markov-chain Monte Carlo method with skew detailed balance conditions , 2013 .

[19]  Max Welling,et al.  Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget , 2013, ICML 2014.

[20]  Ryan Babbush,et al.  Bayesian Sampling Using Stochastic Gradient Thermostats , 2014, NIPS.

[21]  Arnaud Doucet,et al.  Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach , 2014, ICML.

[22]  A. Stuart,et al.  Spectral gaps for a Metropolis–Hastings algorithm in infinite dimensions , 2011, 1112.1392.

[23]  Ryan P. Adams,et al.  Firefly Monte Carlo: Exact MCMC with Subsets of Data , 2014, UAI.

[24]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[25]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[26]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[27]  Lawrence Carin,et al.  On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators , 2015, NIPS.

[28]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[29]  Zhe Gan,et al.  Learning Weight Uncertainty with Stochastic Gradient MCMC for Shape Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Kai Fan,et al.  High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models , 2015, AAAI.

[31]  E. Fox,et al.  A Unifying Framework for Devising Efficient and Irreversible MCMC Samplers , 2016 .

[32]  Yee Whye Teh,et al.  Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics , 2016, J. Mach. Learn. Res..

[33]  John Canny,et al.  An Efficient Minibatch Acceptance Test for Metropolis-Hastings , 2016, UAI.

[34]  Zhe Gan,et al.  Stochastic Gradient Monomial Gamma Sampler , 2017, ICML.

[35]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[36]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[37]  Zhe Gan,et al.  Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling , 2016, ACL.

[38]  Christopher De Sa,et al.  Minibatch Gibbs Sampling on Large Graphical Models , 2018, ICML.

[39]  Mingyuan Zhou,et al.  Semi-Implicit Variational Inference , 2018, ICML.

[40]  Mert Gürbüzbalaban,et al.  Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration , 2018, Oper. Res..

[41]  Martin J. Wainwright,et al.  Log-concave sampling: Metropolis-Hastings algorithms are fast! , 2018, COLT.

[42]  Arnak S. Dalalyan,et al.  User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[43]  Christopher De Sa,et al.  Poisson-Minibatching for Gibbs Sampling with Convergence Rate Guarantees , 2019, NeurIPS.

[44]  Lei Wu,et al.  Irreversible samplers from jump and continuous Markov processes , 2016, Stat. Comput..

[45]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.