High-dimensional dynamics of generalization error in neural networks

We perform an analysis of the average generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant “high-dimensional” regime where the number of free parameters in the network is on the order of or even larger than the number of examples in the dataset. Using random matrix theory and exact solutions in linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem. We find that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks. Overtraining is worst at intermediate network sizes, when the effective number of free parameters equals the number of samples, and thus can be reduced by making a network smaller or larger. Additionally, in the high-dimensional regime, low generalization error requires starting with small initial weights. We then turn to non-linear neural networks, and show that making networks very large does not harm their generalization performance. On the contrary, it can in fact reduce overtraining, even without early stopping or regularization of any sort. We identify two novel phenomena underlying this behavior in overcomplete models: first, there is a frozen subspace of the weights in which no learning occurs under gradient descent; and second, the statistical properties of the high-dimensional regime yield better-conditioned input correlations which protect against overtraining. We demonstrate that standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks, and derive an alternative bound which incorporates the frozen subspace and conditioning effects and qualitatively matches the behavior observed in simulation.

[1]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[2]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[3]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[4]  Raj Rao Nadakuditi,et al.  The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices , 2009, 0910.2120.

[5]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[6]  Surya Ganguli,et al.  Statistical Mechanics of Deep Learning , 2020, Annual Review of Condensed Matter Physics.

[7]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[8]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[9]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[10]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[11]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[12]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[13]  Thomas L. Griffiths,et al.  Advances in Neural Information Processing Systems 21 , 1993, NIPS 2009.

[14]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[15]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[16]  Pierre Baldi,et al.  Temporal Evolution of Generalization during Learning in Linear Networks , 1991, Neural Computation.

[17]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[18]  Robert H. Dodier,et al.  Geometry of Early Stopping in Linear Networks , 1995, NIPS.

[19]  Kenji Fukumizu,et al.  Effect of Batch Learning in Multilayer Neural Networks , 1998, ICONIP.

[20]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[21]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[22]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[23]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[24]  A. P. Dunmur,et al.  Learning and generalization in a linear perceptron stochastically trained with noisy data , 1993 .

[25]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[26]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[27]  Levent Sagun,et al.  A jamming transition from under- to over-parametrization affects generalization in deep learning , 2018, Journal of Physics A: Mathematical and Theoretical.

[28]  Shun-ichi Amari,et al.  Dynamics of learning near singularities in radial basis function networks , 2008, Neural Networks.

[29]  N. Pillai,et al.  Universality of covariance matrices , 2011, 1110.2501.

[30]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[31]  Thomas Villmann,et al.  Similarity-Based Clustering, Recent Developments and Biomedical Applications [outcome of a Dagstuhl Seminar] , 2009, Similarity-Based Clustering.

[32]  Michael Biehl,et al.  Statistical Mechanics of On-line Learning , 2009, Similarity-Based Clustering.

[33]  Kanter,et al.  Eigenvalues of covariance matrices: Application to neural-network learning. , 1991, Physical review letters.

[34]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[35]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[36]  Jonathan Kadmon,et al.  Optimal Architectures in a Solvable Model of Deep Networks , 2016, NIPS.

[37]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  M. Rattray,et al.  Statistical mechanics of learning multiple orthogonal signals: asymptotic theory and fluctuation effects. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[40]  Klaus-Robert Müller,et al.  Asymptotic statistical theory of overtraining and cross-validation , 1997, IEEE Trans. Neural Networks.

[41]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[42]  Kinouchi,et al.  On-line versus off-line learning in the linear perceptron: A comparative study. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[43]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[44]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[45]  Klaus-Robert Müller,et al.  Statistical Theory of Overtraining - Is Cross-Validation Asymptotically Effective? , 1995, NIPS.

[46]  T. Watkin,et al.  THE STATISTICAL-MECHANICS OF LEARNING A RULE , 1993 .

[47]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[48]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[49]  J. W. Silverstein,et al.  Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[50]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[51]  J. Hertz,et al.  Generalization in a linear perceptron in the presence of noise , 1992 .

[52]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[53]  Surya Ganguli,et al.  An equivalence between high dimensional Bayes optimal inference and M-estimation , 2016, NIPS.

[54]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[55]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[56]  Yves Chauvin Generalization Dynamics in LMS Trained Linear Networks , 1990, NIPS.

[57]  Surya Ganguli,et al.  Statistical Mechanics of Optimal Convex Inference in High Dimensions , 2016 .

[58]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[59]  Shun-ichi Amari,et al.  Dynamics of Learning Near Singularities in Layered Networks , 2008, Neural Computation.

[60]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[61]  Kurt Hornik,et al.  Learning in linear neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[62]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[63]  Nello Cristianini,et al.  Supervised and Unsupervised Learning , 2004 .

[64]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[65]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[66]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[67]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[68]  Raj Rao Nadakuditi,et al.  The singular values and vectors of low rank perturbations of large rectangular random matrices , 2011, J. Multivar. Anal..

[69]  Kurt Hornik,et al.  Supervised and Unsupervised Learning in Linear Networks , 1990 .

[70]  S. Ganguli,et al.  Statistical Mechanics of High Dimensional Inference Supplementary Material , 2016 .

[71]  S. Ganguli,et al.  Statistical mechanics of complex neural systems and high dimensional data , 2013, 1301.7115.

[72]  Taiji Suzuki,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.