Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways for deep neural networks. First, the degree of regularization in this phase significantly impacts the final generalization. Second, it is accompanied by a rapid change in the local loss curvature influenced by regularization choices. Connecting these two findings, we show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM) from the beginning of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We further show that the early value of the trace of the FIM correlates strongly with the final generalization. We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that 1) it limits memorization by reducing the learning speed of examples with noisy labels more than that of the clean examples, and 2) trajectories with a low initial trace of the FIM end in flat minima, which are commonly associated with good generalization.

[1]  M. Hutchinson A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .

[2]  Yang Yuan,et al.  Asymmetric Valleys: Beyond Sharp and Flat Local Minima , 2019, NeurIPS.

[3]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[4]  David J. Schwab,et al.  The Early Phase of Neural Network Training , 2020, ICLR.

[5]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[6]  Andrew Gordon Wilson,et al.  Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited , 2020, ArXiv.

[7]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[8]  Shun-ichi Amari,et al.  Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[9]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[10]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[11]  Srini Narayanan,et al.  Stiffness: A New Perspective on Generalization in Neural Networks , 2019, ArXiv.

[12]  Yuichi Yoshida,et al.  Spectral Norm Regularization for Improving the Generalizability of Deep Learning , 2017, ArXiv.

[13]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[14]  Aleksander Madry,et al.  The Two Regimes of Deep Network Training , 2020, ArXiv.

[15]  Di Huang,et al.  Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels , 2020, ICML.

[16]  Max Welling,et al.  Gradient 𝓁1 Regularization for Quantization Robustness , 2020, ArXiv.

[17]  Satrajit Chatterjee,et al.  Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization , 2020, ICLR.

[18]  Harris Drucker,et al.  Improving generalization performance using double backpropagation , 1992, IEEE Trans. Neural Networks.

[19]  Chunpeng Wu,et al.  SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning , 2018, 1805.07898.

[20]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[21]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[22]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[25]  Yoshua Bengio,et al.  How to Initialize your Network? Robust Initialization for WeightNorm & ResNets , 2019, NeurIPS.

[26]  Nicolas Le Roux,et al.  On the interplay between noise and curvature and its effect on optimization and generalization , 2019, AISTATS.

[27]  Behnam Neyshabur,et al.  Implicit Regularization in Deep Learning , 2017, ArXiv.

[28]  Junnan Li,et al.  DivideMix: Learning with Noisy Labels as Semi-supervised Learning , 2020, ICLR.

[29]  Yoshua Bengio,et al.  On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[30]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[32]  Jie Fu,et al.  Jacobian Adversarially Regularized Networks for Robustness , 2020, ICLR.

[33]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[34]  Samuel L. Smith,et al.  Batch Normalization Biases Deep Residual Networks Towards Shallow Paths , 2020, ArXiv.

[35]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[36]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[37]  Ya Le,et al.  Tiny ImageNet Visual Recognition Challenge , 2015 .

[38]  David G.T. Barrett,et al.  Implicit Gradient Regularization , 2020, ArXiv.

[39]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Zhi-Qin John Xu,et al.  Understanding training and generalization in deep learning by Fourier analysis , 2018, ArXiv.

[42]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[43]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[44]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[45]  Stefano Soatto,et al.  Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence , 2019, NeurIPS.

[46]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[47]  Judy Hoffman,et al.  Robust Learning with Jacobian Regularization , 2019, ArXiv.

[48]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[49]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[50]  Stefano Soatto,et al.  Critical Learning Periods in Deep Networks , 2018, ICLR.

[51]  Masashi Sugiyama,et al.  Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis , 2019, ICML.

[52]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[53]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[54]  Tengyu Ma,et al.  Robust and On-the-fly Dataset Denoising for Image Classification , 2020, ECCV.

[55]  Kyunghyun Cho,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[56]  Dániel Varga,et al.  Gradient Regularization Improves Accuracy of Discriminative Models , 2017, ArXiv.

[57]  Jeffrey Pennington,et al.  The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks , 2020, NeurIPS.