Partitioned integrators for thermodynamic parameterization of neural networks

Stochastic Gradient Langevin Dynamics, the "unadjusted Langevin algorithm", and Adaptive Langevin Dynamics (also known as Stochastic Gradient Nose-Hoover dynamics) are examples of existing thermodynamic parameterization methods in use for machine learning, but these can be substantially improved. We find that by partitioning the parameters based on natural layer structure we obtain schemes with rapid convergence for data sets with complicated loss landscapes. We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic differential equations, which are adapted to feed-forward neural networks, including LaLa (a multi-layer Langevin algorithm), AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (i) faster, (ii) more accurate, and (iii) more robust than standard algorithms incorporated into machine learning frameworks, in particular for data sets with complicated loss landscapes. Moreover, we show in numerical studies that methods based on sampling excite many degrees of freedom. The equipartition property, which is a consequence of their ergodicity, means that these methods keep in play an ensemble of low-loss states during the training process. By drawing parameter states from a sufficiently rich distribution of nearby candidate states, we show that the thermodynamic schemes produce smoother classifiers, improve generalization and reduce overfitting compared to traditional optimizers.

[1]  Benedict Leimkuhler,et al.  Hypocoercivity properties of adaptive Langevin dynamics , 2020, SIAM J. Appl. Math..

[2]  Micah Goldblum,et al.  Understanding Generalization through Visualizations , 2019, ICBINB@NeurIPS.

[3]  J. Yosinski,et al.  LCA: Loss Change Allocation for Neural Network Training , 2019, NeurIPS.

[4]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[5]  Eric Moulines,et al.  The promises and pitfalls of Stochastic Gradient Langevin Dynamics , 2018, NeurIPS.

[6]  David P. Herzog Exponential relaxation of the Nos\'e-Hoover equation under Brownian heating , 2018, 1804.05153.

[7]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[8]  Vincent Danos,et al.  Langevin Dynamics with Variable Coefficients and Nonconservative Forces: From Stationary States to Numerical Methods , 2017, Entropy.

[9]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[10]  Andrew Y. Ng,et al.  Improving palliative care with deep learning , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[11]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[12]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[13]  Levent Sagun,et al.  Energy landscapes for machine learning. , 2017, Physical chemistry chemical physics : PCCP.

[14]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[16]  Daniel Jiwoong Im,et al.  An empirical analysis of the optimization of deep network loss surfaces , 2016, 1612.04010.

[17]  Yann LeCun,et al.  Singularity of the Hessian in Deep Learning , 2016, ArXiv.

[18]  Daniel Jiwoong Im,et al.  An Empirical Analysis of Deep Network Loss Surfaces , 2016, ArXiv.

[19]  BENEDICT LEIMKUHLER,et al.  Adaptive Thermostats for Noisy Gradient Systems , 2015, SIAM J. Sci. Comput..

[20]  Bharat Singh,et al.  Layer-Specific Adaptive Learning Rates for Deep Networks , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[21]  É. Moulines,et al.  Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[22]  B. Leimkuhler,et al.  Molecular Dynamics: With Deterministic and Stochastic Numerical Methods , 2015 .

[23]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[24]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[27]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[28]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[29]  B. Leimkuhler,et al.  The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics , 2013, 1308.5814.

[30]  Ryan Babbush,et al.  Bayesian Sampling Using Stochastic Gradient Thermostats , 2014, NIPS.

[31]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[32]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[33]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[34]  B. Leimkuhler,et al.  Adaptive stochastic methods for sampling driven molecular systems. , 2011, The Journal of chemical physics.

[35]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[36]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[37]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[38]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[39]  Anthony Auerbach,et al.  Observations on rate theory for rugged energy landscapes. , 2008, Biophysical journal.

[40]  C. Mouhot,et al.  Hypocoercivity for kinetic equations with linear relaxation terms , 2008, 0810.3493.

[41]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[42]  Jonathan C. Mattingly,et al.  Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise , 2002 .

[43]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[44]  R. Tweedie,et al.  Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[45]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[46]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[47]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[48]  S. Meyn,et al.  Stability of Markovian processes II: continuous-time processes and sampled chains , 1993, Advances in Applied Probability.

[49]  G. Parisi,et al.  Simulated tempering: a new Monte Carlo scheme , 1992, hep-lat/9205018.

[50]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[51]  R. Zwanzig,et al.  Diffusion in a rough potential. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[52]  K. Vahala Handbook of stochastic methods for physics, chemistry and the natural sciences , 1986, IEEE Journal of Quantum Electronics.

[53]  C. W. Gardiner,et al.  Handbook of stochastic methods - for physics, chemistry and the natural sciences, Second Edition , 1986, Springer series in synergetics.

[54]  Hoover,et al.  Canonical dynamics: Equilibrium phase-space distributions. , 1985, Physical review. A, General physics.

[55]  S. Nosé A unified formulation of the constant temperature molecular dynamics methods , 1984 .

[56]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.