Implicit Regularization and Convergence for Weight Normalization

Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-squares regression. WN and rPGD reparametrize the weights with a scale g and a unit vector w and thus the objective function becomes non-convex. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and converge close to the minimum l2 norm solution, even for initializations far from zero. For certain stepsizes of g and w , we show that they can converge close to the minimum norm solution. This is different from the behavior of gradient descent, which converges to the minimum norm solution only when started at a point in the range space of the feature matrix, and is thus more sensitive to initialization.

[1]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[2]  Nathan Srebro,et al.  Dropout: Explicit Forms and Capacity Control , 2020, ICML.

[3]  Michael W. Mahoney,et al.  Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , 2018, J. Mach. Learn. Res..

[4]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[5]  Guido Mont'ufar,et al.  Optimization Theory for ReLU Neural Networks Trained with Normalization Layers , 2020, ICML.

[6]  Sifan Liu,et al.  Ridge Regression: Structure, Cross-Validation, and Sketching , 2019, ICLR.

[7]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[8]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[9]  Zhiyuan Zhang,et al.  Understanding and Improving Layer Normalization , 2019, NeurIPS.

[10]  Yuandong Tian Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network , 2019, ArXiv.

[11]  Varun Kanade,et al.  Implicit Regularization for Optimal Sparse Recovery , 2019, NeurIPS.

[12]  Tomaso A. Poggio,et al.  Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization , 2019, ArXiv.

[13]  Edgar Dobriban,et al.  Invariance reduces Variance: Understanding Data Augmentation in Deep Learning and Beyond , 2019, ArXiv.

[14]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[15]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[16]  Yuandong Tian,et al.  Luck Matters: Understanding Training Dynamics of Deep ReLU Networks , 2019, ArXiv.

[17]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[18]  Xiangru Lian,et al.  Revisit Batch Normalization: New Understanding and Refinement via Composition Optimization , 2019, AISTATS.

[19]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[20]  Michael W. Mahoney,et al.  Traditional and Heavy-Tailed Self Regularization in Neural Network Models , 2019, ICML.

[21]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[22]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[23]  Zuowei Shen,et al.  A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent , 2018, ICML.

[24]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[25]  Ping Luo,et al.  Towards Understanding Regularization in Batch Normalization , 2018, ICLR.

[26]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[27]  Thomas Hofmann,et al.  Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.

[28]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[29]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[30]  Raman Arora,et al.  On the Implicit Bias of Dropout , 2018, ICML.

[31]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[32]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[33]  Xiaoxia Wu,et al.  WNGrad: Learn the Learning Rate in Gradient Descent , 2018, ArXiv.

[34]  Elad Hoffer,et al.  Norm matters: efficient and accurate normalization schemes in deep networks , 2018, NeurIPS.

[35]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[36]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[37]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[38]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[39]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[40]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[41]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[42]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[43]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[44]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[45]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[46]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[47]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[48]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[49]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[50]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[52]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[53]  Andrea Montanari,et al.  The phase transition of matrix recovery from Gaussian measurements matches the minimax MSE of matrix denoising , 2013, Proceedings of the National Academy of Sciences.

[54]  Michael W. Mahoney Approximate computation and implicit regularization for very large-scale data analysis , 2012, PODS.

[55]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[56]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[57]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[58]  Bernard Widrow,et al.  Least-mean-square adaptive filters , 2003 .

[59]  Sun-Yuan Kung,et al.  On gradient adaptation with unit-norm constraints , 2000, IEEE Trans. Signal Process..

[60]  Steve Rogers,et al.  Adaptive Filter Theory , 1996 .

[61]  Monson H. Hayes,et al.  Statistical Digital Signal Processing and Modeling , 1996 .

[62]  U. Helmke,et al.  Optimization and Dynamical Systems , 1994, Proceedings of the IEEE.

[63]  John G. Proakis,et al.  Digital Signal Processing: Principles, Algorithms, and Applications , 1992 .

[64]  Hervé Bourlard,et al.  Generalization and Parameter Estimation in Feedforward Netws: Some Experiments , 1989, NIPS.

[65]  O. Strand Theory and methods related to the singular-function expansion and Landweber's iteration for integral equations of the first kind , 1974 .