Information matrices and generalization

This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Behnam Neyshabur,et al.  Implicit Regularization in Deep Learning , 2017, ArXiv.

[3]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[4]  Pascal Vincent,et al.  Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[5]  Hilbert J. Kappen,et al.  On-line learning processes in artificial neural networks , 1993 .

[6]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[7]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[8]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[9]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[10]  Lei Wu,et al.  The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent , 2018, ArXiv.

[11]  Mark W. Schmidt Convergence rate of stochastic gradient with constant step size , 2014 .

[12]  Laurent Boué Real numbers, data science and chaos: How to fit any dataset with a single parameter , 2019, ArXiv.

[13]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[14]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[15]  Jascha Sohl-Dickstein,et al.  PCA of high dimensional random walks with comparison to neural network training , 2018, NeurIPS.

[16]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[17]  Vahab S. Mirrokni,et al.  Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions , 2018, ICML.

[18]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[19]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[20]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[21]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[22]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[25]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[26]  Trac D. Tran,et al.  A Scale Invariant Flatness Measure for Deep Network Minima , 2019, ArXiv.

[27]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[28]  Sho Yaida,et al.  Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.

[29]  H. Akaike A new look at the statistical model identification , 1974 .

[30]  Vahid Tarokh,et al.  On Optimal Generalizability in Parametric Learning , 2017, NIPS.

[31]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[32]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[33]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[34]  Frederik Kunstner,et al.  Limitations of the Empirical Fisher Approximation , 2019, NeurIPS.

[35]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[36]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[37]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[38]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[39]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[40]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[41]  Nicolas Le Roux,et al.  Improving First and Second-Order Methods by Modeling Uncertainty , 2010 .