Multi-scale Feature Learning Dynamics: Insights for Double Descent

A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting from the highdimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of “double descent” of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.

[1]  Edgar Dobriban,et al.  The Implicit Regularization of Stochastic Gradient Flow for Least Squares , 2020, ICML.

[2]  S. Bös STATISTICAL MECHANICS APPROACH TO EARLY STOPPING AND WEIGHT DECAY , 1998 .

[3]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[4]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[5]  Florent Krzakala,et al.  Statistical physics-based reconstruction in compressed sensing , 2011, ArXiv.

[6]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[7]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[8]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[9]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[10]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[11]  Aaron C. Courville,et al.  Gradient Starvation: A Learning Proclivity in Neural Networks , 2020, NeurIPS.

[12]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[13]  Cory Stephenson,et al.  When and how epochwise double descent happens , 2021, ArXiv.

[14]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[15]  E. Gardner,et al.  Three unfinished works on the optimal storage capacity of networks , 1989 .

[16]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[17]  Florent Krzakala,et al.  The Gaussian equivalence of generative models for learning with two-layer neural networks , 2020, ArXiv.

[18]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[19]  R Kuhnt Statistical mechanics for neural networks with continuous-time dynamics , 1992 .

[20]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[21]  Florent Krzakala,et al.  Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.

[22]  Kanter,et al.  Eigenvalues of covariance matrices: Application to neural-network learning. , 1991, Physical review letters.

[23]  J. Hertz,et al.  Generalization in a linear perceptron in the presence of noise , 1992 .

[24]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[25]  Manfred Opper,et al.  Statistical mechanics of generalization , 1998 .

[26]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[27]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[28]  Reinhard Heckel,et al.  Early Stopping in Deep Networks: Double Descent and How to Eliminate it , 2021, ICLR.

[29]  Ioannis Mitliagkas,et al.  A Modern Take on the Bias-Variance Tradeoff in Neural Networks , 2018, ArXiv.

[30]  E. Gardner,et al.  Optimal storage properties of neural network models , 1988 .

[31]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[32]  Levent Sagun,et al.  Triple descent and the two kinds of overfitting: where and why do they appear? , 2020, NeurIPS.

[33]  Dongrui Wu,et al.  Rethink the Connections among Generalization, Memorization, and the Spectral Bias of DNNs , 2020, IJCAI.

[34]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[35]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[36]  Surya Ganguli,et al.  Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics , 2021, ICLR.

[37]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[38]  M. Opper Statistical Mechanics of Learning : Generalization , 2002 .

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Opper,et al.  Generalization ability of perceptrons with continuous outputs. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[41]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[42]  Yoshiyuki Kabashima,et al.  Erratum: A typical reconstruction limit of compressed sensing based on Lp-norm minimization , 2009, ArXiv.

[43]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[44]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[47]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[48]  Arthur Jacot,et al.  Implicit Regularization of Random Feature Models , 2020, ICML.

[49]  Chong You,et al.  Rethinking Bias-Variance Trade-off for Generalization of Neural Networks , 2020, ICML.

[50]  Shun-ichi Amari,et al.  When Does Preconditioning Help or Hurt Generalization? , 2021, ICLR.

[51]  E. Gardner The space of interactions in neural network models , 1988 .

[52]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Yuri Burda,et al.  Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , 2022, ArXiv.

[54]  Amin Karbasi,et al.  Multiple Descent: Design Your Own Generalization Curve , 2020, NeurIPS.

[55]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[56]  S. Orszag,et al.  Advanced mathematical methods for scientists and engineers I: asymptotic methods and perturbation theory. , 1999 .

[57]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[58]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[59]  Levent Sagun,et al.  On the interplay between data structure and loss function in classification problems , 2021, NeurIPS.

[60]  M. Mézard,et al.  Spin Glass Theory And Beyond: An Introduction To The Replica Method And Its Applications , 1986 .

[61]  Florent Krzakala,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[62]  Taiji Suzuki,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[63]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .