Epoch-Wise Double Descent Triggered by Learning a Single Sample

Recently, it has been empirically shown that if neural networks are trained on a single training image to reconstruct the input image, fully-connected neural networks (FCNs) learn to output the training image regardless of the test image (memorization), whereas fully-convolutional networks learn to output the given test images (generalization). We further investigated FCNs and empirically found that optimizers play an important role in the memorization process by splitting the model into the bias of the output layer (the bias component) and the rest (the weight component). Specifically, neural networks memorize the training image by the weight component at the early stage of training irrespective of the optimizer. Conversely, as the epoch increases, some optimizers force the weight component to forget the training image and the bias component to memorize the training image. By assuming that FCNs converge to the constant function during this shift, and measuring the generalization error, we observed epoch-wise double descent, which explains why early-stopping contributes to better generalizability.

[1]  Yoshua Bengio,et al.  Multi-scale Feature Learning Dynamics: Insights for Double Descent , 2021, ICML.

[2]  Erkun Yang,et al.  Understanding and Improving Early Stopping for Learning with Noisy Labels , 2021, NeurIPS.

[3]  Hanlin Tang,et al.  On the geometry of generalization and memorization in deep neural networks , 2021, ICLR.

[4]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.

[5]  Jeffrey Pennington,et al.  Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition , 2020, NeurIPS.

[6]  Kan Chen,et al.  A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks , 2020, AISTATS.

[7]  Reinhard Heckel,et al.  Early Stopping in Deep Networks: Double Descent and How to Eliminate it , 2020, ICLR.

[8]  Jeffrey Pennington,et al.  The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization , 2020, ICML.

[9]  Sheng Liu,et al.  Early-Learning Regularization Prevents Memorization of Noisy Labels , 2020, NeurIPS.

[10]  Levent Sagun,et al.  Triple descent and the two kinds of overfitting: where and why do they appear? , 2020, NeurIPS.

[11]  G. Biroli,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[12]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[13]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[14]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[15]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[16]  Samy Bengio,et al.  Identity Crisis: Memorization and Generalization under Extreme Overparameterization , 2019, ICLR.

[17]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[18]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[19]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[20]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[21]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[22]  Arthur Jacot,et al.  Neural Tangent Kernel: Convergence and Generalization in Neural Networks , 2018, NeurIPS.

[23]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[24]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[25]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[26]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[27]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  John C. Duchi,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .