Revisiting the Effects of Stochasticity for Hamiltonian Samplers

We revisit the theoretical properties of Hamiltonian stochastic differential equations (sdes) for Bayesian posterior sampling, and we study the two types of errors that arise from numerical sde simulation: the discretization error and the error due to noisy gradient estimates in the context of data subsampling. Our main result is a novel analysis for the effect of minibatches through the lens of differential operator splitting, revising previous literature results. The stochastic component of a Hamiltonian sde is decoupled from the gradient noise, for which we make no normality assumptions. This leads to the identification of a convergence bottleneck: when considering mini-batches, the best achievable error rate is O(η), with η being the integrator step size. Our theoretical results are supported by an empirical study on a variety of regression and classification tasks for Bayesian neural networks.

[1]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[2]  Michael I. Jordan,et al.  On Learning Rates and Schrödinger Operators , 2020, J. Mach. Learn. Res..

[3]  N. Hatano,et al.  Finding Exponential Product Formulas of Higher Orders , 2005, math-ph/0506007.

[4]  Ahn,et al.  Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[5]  M. V. Tretyakov,et al.  Computing ergodic limits for Langevin equations , 2007 .

[6]  Lawrence Carin,et al.  On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators , 2015, NIPS.

[7]  Masashi Sugiyama,et al.  Accelerating the diffusion-based ensemble sampling by non-reversible dynamics , 2020, ICML.

[8]  Yee Whye Teh,et al.  Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics , 2016, J. Mach. Learn. Res..

[9]  Chi Zhang Randomized Algorithms for Hamiltonian Simulation , 2012 .

[10]  D. Talay Stochastic Hamiltonian Systems : Exponential Convergence to the Invariant Measure , and Discretization by the Implicit Euler Scheme , 2002 .

[11]  Assyr Abdulle,et al.  Long Time Accuracy of Lie-Trotter Splitting Methods for Langevin Dynamics , 2015, SIAM J. Numer. Anal..

[12]  Michael I. Jordan,et al.  On dissipative symplectic integration with applications to gradient-based optimization , 2020 .

[13]  Erwan Faou,et al.  Weak Backward Error Analysis for SDEs , 2011, SIAM J. Numer. Anal..

[14]  Nathan Wiebe,et al.  Well-conditioned multiproduct Hamiltonian simulation , 2019, 1907.11679.

[15]  D. Talay,et al.  Expansion of the global error for numerical schemes solving stochastic differential equations , 1990 .

[16]  Yuan Su,et al.  Faster quantum simulation by randomization , 2018, Quantum.

[17]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[18]  A. Horowitz A generalized guided Monte Carlo algorithm , 1991 .

[19]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[20]  Houman Owhadi,et al.  Long-Run Accuracy of Variational Integrators in the Stochastic Context , 2007, SIAM J. Numer. Anal..

[21]  Minh C. Tran,et al.  Theory of Trotter Error with Commutator Scaling , 2021 .

[22]  Quanquan Gu,et al.  On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients , 2021, ICML.

[23]  Assyr Abdulle,et al.  High Order Numerical Approximation of the Invariant Measure of Ergodic SDEs , 2014, SIAM J. Numer. Anal..

[24]  N. G.,et al.  Quasi-symplectic methods for Langevin-type equations , 2003 .

[25]  Daniel P. Robinson,et al.  Conformal symplectic and relativistic optimization , 2019, NeurIPS.

[26]  L. Hörmander Hypoelliptic second order differential equations , 1967 .

[27]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[28]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[29]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[30]  T. Faniran Numerical Solution of Stochastic Differential Equations , 2015 .

[31]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[32]  Jinghui Chen,et al.  Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization , 2017, NeurIPS.

[33]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[34]  Mert Gürbüzbalaban,et al.  Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization , 2018, NIPS 2018.

[35]  Yuan Su,et al.  Nearly optimal lattice simulation by product formulas , 2019, Physical review letters.

[36]  Sho Yaida,et al.  Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.

[37]  Aaron Klein,et al.  Bayesian Optimization with Robust Bayesian Neural Networks , 2016, NIPS.

[38]  John E. Moody,et al.  Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria , 1992, NIPS.

[39]  Jaehoon Lee,et al.  Finite Versus Infinite Neural Networks: an Empirical Study , 2020, NeurIPS.

[40]  Andrew M. Stuart,et al.  Convergence of Numerical Time-Averaging and Stationary Measures via Poisson Equations , 2009, SIAM J. Numer. Anal..

[41]  G. N. Milstein,et al.  Symplectic Integration of Hamiltonian Systems with Additive Noise , 2001, SIAM J. Numer. Anal..

[42]  S. Swain Handbook of Stochastic Methods for Physics, Chemistry and the Natural Sciences , 1984 .

[43]  Michael I. Jordan,et al.  On Symplectic Optimization , 2018, 1802.03653.

[44]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[45]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[46]  M. Suzuki,et al.  On the convergence of exponential operators—the Zassenhaus formula, BCH formula and systematic approximants , 1977 .

[47]  Jishnu Mukhoti,et al.  On the Importance of Strong Baselines in Bayesian Deep Learning , 2018, ArXiv.