Weighted Optimization: better generalization by smoother interpolation

We provide a rigorous analysis of how implicit bias towards smooth interpolations leads to low generalization error in the overparameterized setting. We provide the first case study of this connection through a random Fourier series model and weighted least squares. We then argue through this model and numerical experiments that normalization methods in deep learning such as weight normalization improve generalization in overparameterized neural networks by implicitly encouraging smooth interpolants.

[1]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[2]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Twan van Laarhoven,et al.  L2 Regularization versus Batch and Weight Normalization , 2017, ArXiv.

[4]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[5]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[6]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[7]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[8]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[9]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[10]  Zhiyuan Zhang,et al.  Understanding and Improving Layer Normalization , 2019, NeurIPS.

[11]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[12]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[13]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[18]  Levent Sagun,et al.  Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[19]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[20]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[21]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.