论文信息 - Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways for deep neural networks. First, the degree of regularization in this phase significantly impacts the final generalization. Second, it is accompanied by a rapid change in the local loss curvature influenced by regularization choices. Connecting these two findings, we show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM) from the beginning of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We further show that the early value of the trace of the FIM correlates strongly with the final generalization. We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that 1) it limits memorization by reducing the learning speed of examples with noisy labels more than that of the clean examples, and 2) trajectories with a low initial trace of the FIM end in flat minima, which are commonly associated with good generalization.

[1] M. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .

[2] Yang Yuan,et al. Asymmetric Valleys: Beyond Sharp and Flat Local Minima , 2019, NeurIPS.

[3] Ethan Dyer,et al. Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[4] David J. Schwab,et al. The Early Phase of Neural Network Training , 2020, ICLR.

[5] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.

[6] Andrew Gordon Wilson,et al. Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited , 2020, ArXiv.

[7] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[8] Shun-ichi Amari,et al. Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[9] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[10] Jascha Sohl-Dickstein,et al. The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[11] Srini Narayanan,et al. Stiffness: A New Perspective on Generalization in Neural Networks , 2019, ArXiv.

[12] Yuichi Yoshida,et al. Spectral Norm Regularization for Improving the Generalizability of Deep Learning , 2017, ArXiv.

[13] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[14] Aleksander Madry,et al. The Two Regimes of Deep Network Training , 2020, ArXiv.

[15] Di Huang,et al. Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels , 2020, ICML.

[16] Max Welling,et al. Gradient 𝓁1 Regularization for Quantization Robustness , 2020, ArXiv.

[17] Satrajit Chatterjee,et al. Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization , 2020, ICLR.

[18] Harris Drucker,et al. Improving generalization performance using double backpropagation , 1992, IEEE Trans. Neural Networks.

[19] Chunpeng Wu,et al. SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning , 2018, 1805.07898.

[20] Tomaso A. Poggio,et al. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[21] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[22] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.

[23] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[25] Yoshua Bengio,et al. How to Initialize your Network? Robust Initialization for WeightNorm & ResNets , 2019, NeurIPS.

[26] Nicolas Le Roux,et al. On the interplay between noise and curvature and its effect on optimization and generalization , 2019, AISTATS.

[27] Behnam Neyshabur,et al. Implicit Regularization in Deep Learning , 2017, ArXiv.

[28] Junnan Li,et al. DivideMix: Learning with Noisy Labels as Semi-supervised Learning , 2020, ICLR.

[29] Yoshua Bengio,et al. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[30] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Quoc V. Le,et al. A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[32] Jie Fu,et al. Jacobian Adversarially Regularized Networks for Robustness , 2020, ICLR.

[33] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.

[34] Samuel L. Smith,et al. Batch Normalization Biases Deep Residual Networks Towards Shallow Paths , 2020, ArXiv.

[35] Hossein Mobahi,et al. Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[36] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.

[37] Ya Le,et al. Tiny ImageNet Visual Recognition Challenge , 2015 .

[38] David G.T. Barrett,et al. Implicit Gradient Regularization , 2020, ArXiv.

[39] Pascal Vincent,et al. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[40] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41] Zhi-Qin John Xu,et al. Understanding training and generalization in deep learning by Fourier analysis , 2018, ArXiv.

[42] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.