Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

Overparametrized neural networks trained by gradient descent (GD) can provably overfit any training data. However, the generalization guarantee may not hold for noisy data. From a nonparametric perspective, this paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises. We establish a lower bound on the $L_2$ estimation error with respect to the GD iteration, which is away from zero without a delicate choice of early stopping. In turn, through a comprehensive analysis of $\ell_2$-regularized GD trajectories, we prove that for overparametrized one-hidden-layer ReLU neural network with the $\ell_2$ regularization: (1) the output is close to that of the kernel ridge regression with the corresponding neural tangent kernel; (2) minimax {optimal} rate of $L_2$ estimation error is achieved. Numerical experiments confirm our theory and further demonstrate that the $\ell_2$ regularization approach improves the training robustness and works for a wider range of neural networks.

[1]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[2]  R. Varga Geršgorin And His Circles , 2004 .

[3]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[6]  S. Geer On the uniform convergence of empirical norms and inner products, with application to causal inference , 2013, 1310.5523.

[7]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[8]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[9]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[10]  W. W. Daniel Applied Nonparametric Statistics , 1979 .

[11]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[12]  Tri Dao,et al.  A Kernel Theory of Modern Data Augmentation , 2018, ICML.

[13]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[14]  R. Basri,et al.  On the Similarity between the Laplace and Neural Tangent Kernels , 2020, NeurIPS.

[15]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[16]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[17]  Taiji Suzuki,et al.  Refined Generalization Analysis of Gradient Descent for Over-parameterized Two-layer Neural Networks with Smooth Activations on Classification Problems , 2019, ArXiv.

[18]  Yuan Cao,et al.  Towards Understanding the Spectral Bias of Deep Learning , 2021, IJCAI.

[19]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[20]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2020, ICLR.

[21]  Ekachai Phaisangittisagul,et al.  An Analysis of the Regularization Between L2 and Dropout in Single Hidden Layer Neural Network , 2016, 2016 7th International Conference on Intelligent Systems, Modelling and Simulation (ISMS).

[22]  Jing Wang,et al.  Entropy numbers of Besov classes of generalized smoothness on the sphere , 2014 .

[23]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[24]  Lin Chen,et al.  Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS , 2020, ICLR.

[25]  Adam Krzyzak,et al.  Over-parametrized deep neural networks do not generalize well , 2019 .

[26]  S. Geer Empirical Processes in M-Estimation , 2000 .

[27]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[28]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[29]  C. Frye,et al.  Spherical Harmonics in p Dimensions , 2012, 1205.3548.

[30]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[31]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[32]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[33]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[34]  Twan van Laarhoven,et al.  L2 Regularization versus Batch and Weight Normalization , 2017, ArXiv.

[35]  Atsushi Nitanda,et al.  Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime , 2021, ICLR.

[36]  Ruiqi Liu,et al.  Optimal Nonparametric Inference via Deep Neural Network , 2019, ArXiv.

[37]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[38]  Ming Yuan,et al.  Minimax Optimal Rates of Estimation in High Dimensional Additive Models: Universal Phase Transition , 2015, ArXiv.

[39]  Kawin Setsompop,et al.  Fast image reconstruction with L2‐regularization , 2013, Journal of magnetic resonance imaging : JMRI.

[40]  Mountain View,et al.  On the training dynamics of deep networks with L2 regularization , 2020 .

[41]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[42]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[43]  S. Shott,et al.  Nonparametric Statistics , 2018, The Encyclopedia of Archaeological Sciences.

[44]  Julien Mairal,et al.  On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[45]  Quanquan Gu,et al.  Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[46]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[47]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[48]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  K. Atkinson,et al.  Spherical Harmonics and Approximations on the Unit Sphere: An Introduction , 2012 .

[51]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[52]  Martin J. Wainwright,et al.  Early stopping for non-parametric regression: An optimal data-dependent stopping rule , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[53]  M. Kohler,et al.  On deep learning as a remedy for the curse of dimensionality in nonparametric regression , 2019, The Annals of Statistics.

[54]  Zhiyuan Li,et al.  Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee , 2019, ICLR.

[55]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[56]  Kenji Fukumizu,et al.  Deep Neural Networks Learn Non-Smooth Functions Effectively , 2018, AISTATS.

[57]  V. G. Troitsky,et al.  Journal of Mathematical Analysis and Applications , 1960 .

[58]  J. Dick,et al.  A Characterization of Sobolev Spaces on the Sphere and an Extension of Stolarsky’s Invariance Principle to Arbitrary Smoothness , 2012, 1203.5157.