SGD S MOOTHS THE S HARPEST D IRECTIONS

Stochastic gradient descent (SGD) is able to find regions that generalize well, even in drastically over-parametrized models such as deep neural networks. We observe that noise in SGD controls the spectral norm and conditioning of the Hessian throughout the training. We hypothesize the cause of this phenomenon is due to the dynamics of neurons saturating their non-linearity along the largest curvature directions, thus leading to improved conditioning.

[1]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[2]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[3]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[6]  Yann LeCun,et al.  Singularity of the Hessian in Deep Learning , 2016, ArXiv.

[7]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[8]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[9]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[10]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[11]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .

[12]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.