Stochastic Gradient Langevin Dynamics Algorithms with Adaptive Drifts

Bayesian deep learning offers a principled way to address many issues concerning safety of artificial intelligence (AI), such as model uncertainty,model interpretability, and prediction bias. However, due to the lack of efficient Monte Carlo algorithms for sampling from the posterior of deep neural networks (DNNs), Bayesian deep learning has not yet powered our AI system. We propose a class of adaptive stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithms, where the drift function is biased to enhance escape from saddle points and the bias is adaptively adjusted according to the gradient of past samples. We establish the convergence of the proposed algorithms under mild conditions, and demonstrate via numerical examples that the proposed algorithms can significantly outperform the existing SGMCMC algorithms, such as stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian Monte Carlo (SGHMC) and preconditioned SGLD, in both simulation and optimization tasks.

[1]  Chao Chen,et al.  Uncertainty Estimation of Deep Neural Networks , 2018 .

[2]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[3]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[4]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[5]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[6]  Anca D. Dragan,et al.  Bayesian Robustness: A Nonasymptotic Viewpoint , 2019, Journal of the American Statistical Association.

[7]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[8]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Leonard Hasenclever,et al.  The True Cost of Stochastic Gradient Langevin Dynamics , 2017, 1706.02692.

[11]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[12]  H. Haario,et al.  An adaptive Metropolis algorithm , 2001 .

[13]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.

[14]  M. Girolami,et al.  Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[15]  Wei Deng,et al.  An Adaptive Empirical Bayesian Method for Sparse Deep Learning , 2019, NeurIPS.

[16]  Arnak S. Dalalyan,et al.  User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[17]  F. Liang,et al.  Bayesian Neural Networks for Selection of Drug Sensitive Genes , 2018, Journal of the American Statistical Association.

[18]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[19]  Yee Whye Teh,et al.  Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics , 2016, J. Mach. Learn. Res..

[20]  Yee Whye Teh,et al.  Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics , 2014, J. Mach. Learn. Res..

[21]  Yee Whye Teh,et al.  Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex , 2013, NIPS.

[22]  Mark W. Schmidt,et al.  Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations , 2019, ICML.

[23]  Christopher Nemeth,et al.  Stochastic Gradient Markov Chain Monte Carlo , 2019, Journal of the American Statistical Association.

[24]  Ahn,et al.  Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[25]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[26]  Jonathan C. Mattingly,et al.  Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise , 2002 .

[27]  Veronika Rockova,et al.  Posterior Concentration for Sparse Deep Learning , 2018, NeurIPS.

[28]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Faming Liang,et al.  Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection , 2020, Biometrika.

[31]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[32]  Lawrence Carin,et al.  On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators , 2015, NIPS.

[33]  Hiroshi Nakagawa,et al.  Approximation Analysis of Stochastic Gradient Langevin Dynamics by using Fokker-Planck Equation and Ito Process , 2014, ICML.

[34]  Jinghui Chen,et al.  Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization , 2017, NeurIPS.

[35]  Sanjiv Kumar,et al.  Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.

[36]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..