Efficient and generalizable tuning strategies for stochastic gradient MCMC

Stochastic gradient Markov chain Monte Carlo (SGMCMC) is a popular class of algorithms for scalable Bayesian inference. However, these algorithms include hyperparameters such as step size or batch size that influence the accuracy of estimators based on the obtained posterior samples. As a result, these hyperparameters must be tuned by the practitioner and currently no principled and automated way to tune them exists. Standard MCMC tuning methods based on acceptance rates cannot be used for SGMCMC, thus requiring alternative tools and diagnostics. We propose a novel bandit-based algorithm that tunes the SGMCMC hyperparameters by minimizing the Stein discrepancy between the true posterior and its Monte Carlo approximation. We provide theoretical results supporting this approach and assess various Stein-based discrepancies. We support our results with experiments on both simulated and real datasets, and find that this method is practical for a wide range of applications.

[1]  Arthur Gretton,et al.  A Kernel Test of Goodness of Fit , 2016, ICML.

[2]  Neeraj Pradhan,et al.  Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro , 2019, ArXiv.

[3]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[4]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[5]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[6]  Christophe Andrieu,et al.  A tutorial on adaptive MCMC , 2008, Stat. Comput..

[7]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[8]  Lester W. Mackey,et al.  Measuring Sample Quality with Kernels , 2017, ICML.

[9]  Matti Vihola,et al.  Robust adaptive Metropolis algorithm with coerced acceptance rate , 2010, Statistics and Computing.

[10]  J. Rosenthal,et al.  Optimal scaling of discrete approximations to Langevin diffusions , 1998 .

[11]  A. Gelman,et al.  Weak convergence and optimal scaling of random walk Metropolis algorithms , 1997 .

[12]  Andrew Gordon Wilson,et al.  What Are Bayesian Neural Network Posteriors Really Like? , 2021, ICML.

[13]  Kenji Fukumizu,et al.  A Linear-Time Kernel Goodness-of-Fit Test , 2017, NIPS.

[14]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[15]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[16]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[17]  Eric Moulines,et al.  The promises and pitfalls of Stochastic Gradient Langevin Dynamics , 2018, NeurIPS.

[18]  Jos'e Miguel Hern'andez-Lobato,et al.  Sliced Kernelized Stein Discrepancy , 2020, ICLR.

[19]  Noah D. Goodman,et al.  Pyro: Deep Universal Probabilistic Programming , 2018, J. Mach. Learn. Res..

[20]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[21]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[22]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[23]  Lester W. Mackey,et al.  Measuring Sample Quality with Stein's Method , 2015, NIPS.

[24]  Alexandre H. Thi'ery,et al.  Optimal Scaling and Diffusion Limits for the Langevin Algorithm in High Dimensions , 2011, 1103.0542.

[25]  R. Tweedie,et al.  Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[26]  Zhe Gan,et al.  Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization , 2015, AISTATS.

[27]  Christopher Nemeth,et al.  Stochastic Gradient Markov Chain Monte Carlo , 2019, Journal of the American Statistical Association.

[28]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[29]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[30]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[31]  Ryan Babbush,et al.  Bayesian Sampling Using Stochastic Gradient Thermostats , 2014, NIPS.

[32]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[33]  B. Leimkuhler,et al.  Explorer Adaptive Thermostats for Noisy Gradient Systems , 2016 .

[34]  Sehwan Kim,et al.  Stochastic Gradient Langevin Dynamics Algorithms with Adaptive Drifts , 2020, ArXiv.

[35]  Christopher Nemeth,et al.  Control variates for stochastic gradient MCMC , 2017, Statistics and Computing.

[36]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[37]  Lester W. Mackey,et al.  Stochastic Stein Discrepancies , 2020, NeurIPS.

[38]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[39]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..