On the interplay between noise and curvature and its effect on optimization and generalization

The speed at which one can minimize an expected loss using stochastic methods depends on two properties: the curvature of the loss and the variance of the gradients. While most previous works focus on one or the other of these properties, we explore how their interaction affects optimization speed. Further, as the ultimate goal is good generalization performance, we clarify how both curvature and noise are relevant to properly estimate the generalization gap. Realizing that the limitations of some existing works stems from a confusion between these matrices, we also clarify the distinction between the Fisher matrix, the Hessian, and the covariance matrix of the gradients.

[1]  Vahab S. Mirrokni,et al.  Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions , 2018, ICML.

[2]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[3]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[4]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[5]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[6]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[7]  Vahid Tarokh,et al.  On Optimal Generalizability in Parametric Learning , 2017, NIPS.

[8]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[9]  Trac D. Tran,et al.  A Scale Invariant Flatness Measure for Deep Network Minima , 2019, ArXiv.

[10]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[11]  Nicolas Le Roux,et al.  Improving First and Second-Order Methods by Modeling Uncertainty , 2010 .

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[14]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[15]  Sho Yaida,et al.  Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.

[16]  Frederik Kunstner,et al.  Limitations of the Empirical Fisher Approximation , 2019, NeurIPS.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[19]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[20]  Mark W. Schmidt Convergence rate of stochastic gradient with constant step size , 2014 .

[21]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[22]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[23]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[24]  Pascal Vincent,et al.  Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[25]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[26]  H. Akaike A new look at the statistical model identification , 1974 .

[27]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[28]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[29]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[30]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.