Benign Overfitting in Two-layer Convolutional Neural Networks

Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as “benign overfitting”. Recently, there emerges a line of works studying “benign overfitting” from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.

[1]  Niladri S. Chatterji,et al.  Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data , 2022, COLT.

[2]  Andrea Montanari,et al.  The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training , 2020, The Annals of Statistics.

[3]  Yuanzhi Li,et al.  Feature Purification: How Adversarial Training Performs Robust Deep Learning , 2020, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS).

[4]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[5]  Quanquan Gu,et al.  Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization , 2021, ICLR.

[6]  Zhi-Hua Zhou,et al.  Towards an Understanding of Benign Overfitting in Neural Networks , 2021, ArXiv.

[7]  Mikhail Belkin,et al.  Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures , 2021, NeurIPS.

[8]  Vladimir Braverman,et al.  Benign Overfitting of Constant-Stepsize SGD for Linear Regression , 2021, COLT.

[9]  Quanquan Gu,et al.  Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise , 2021, ICML.

[10]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, J. Mach. Learn. Res..

[11]  Yuan Cao,et al.  How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , 2019, ICLR.

[12]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[13]  Yuanzhi Li,et al.  Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , 2020, International Conference on Learning Representations.

[14]  Jeffrey Pennington,et al.  The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization , 2020, ICML.

[15]  Ji Xu,et al.  On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression , 2020, NeurIPS.

[16]  Zhenyu Liao,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[17]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2019, ICLR.

[18]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[19]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[20]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[21]  Yuan Cao,et al.  Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks , 2019, NeurIPS.

[22]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[23]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[24]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[25]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[26]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[27]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[28]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[29]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[30]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[31]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[32]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[33]  Xiao Zhang,et al.  Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[34]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[35]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[36]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[37]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[38]  Bin Yu,et al.  Stability and Convergence Trade-off of Iterative Optimization Algorithms , 2018, ArXiv.

[39]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[40]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[41]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[42]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[43]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[44]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[45]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[46]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[47]  Kai Zheng,et al.  Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints , 2017, COLT.

[48]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[49]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[50]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[51]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[52]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[53]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[54]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[55]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[56]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.