Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient descent or gradient flow (i.e., gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Our results generalize the previous results for logistic regression with one-layer or multi-layer linear networks, and provide more quantitative convergence results with weaker assumptions than previous results for homogeneous smooth neural networks. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.

[1]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[2]  Javier Peña,et al.  Towards a deeper geometric, analytic and algorithmic understanding of margins , 2014, Optim. Methods Softw..

[3]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[4]  Yuxin Chen,et al.  Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution , 2017, Found. Comput. Math..

[5]  Nathan Srebro,et al.  Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate , 2018, AISTATS.

[6]  Kalyanmoy Deb,et al.  Approximate KKT points and a proximity measure for termination , 2013, J. Glob. Optim..

[7]  Pradeep Ravikumar,et al.  Connecting Optimization and Regularization Paths , 2018, NeurIPS.

[8]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[9]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[10]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[11]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[12]  Matus Telgarsky,et al.  Risk and parameter convergence of logistic regression , 2018, ArXiv.

[13]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2016, IEEE Transactions on Signal Processing.

[14]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[15]  Nathan Srebro,et al.  Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[16]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[17]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[18]  Junwei Lu,et al.  On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond , 2018, ArXiv.

[19]  L. Dries,et al.  Geometric categories and o-minimal structures , 1996 .

[20]  H. B. Curry The method of steepest descent for non-linear minimization problems , 1944 .

[21]  Bo Li,et al.  Better Approximations of High Dimensional Smooth Functions by Deep Neural Networks with Rectified Power Units , 2019, Communications in Computational Physics.

[22]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[23]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[24]  Guy Blanc,et al.  Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process , 2019, COLT.

[25]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[26]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[27]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[28]  Matus Telgarsky,et al.  A refined primal-dual analysis of the implicit bias , 2019, ArXiv.

[29]  Yu. S. Ledyaev,et al.  Nonsmooth analysis and control theory , 1998 .

[30]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[31]  Robert E. Mahony,et al.  Convergence of the Iterates of Descent Methods for Analytic Cost Functions , 2005, SIAM J. Optim..

[32]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[34]  Moustapha Cissé,et al.  Parseval Networks: Improving Robustness to Adversarial Examples , 2017, ICML.

[35]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[36]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[37]  Yi Zhou,et al.  When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models? , 2018 .

[38]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[39]  Ji Zhu,et al.  Margin Maximizing Loss Functions , 2003, NIPS.

[40]  Cem Anil,et al.  Sorting out Lipschitz function approximation , 2018, ICML.

[41]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[42]  Andrew R. Barron,et al.  Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls , 2016, IEEE Transactions on Information Theory.

[43]  M. Coste AN INTRODUCTION TO O-MINIMAL GEOMETRY , 2002 .

[44]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[45]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[46]  Matus Telgarsky,et al.  Margins, Shrinkage, and Boosting , 2013, ICML.

[47]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[48]  J. Palis,et al.  Geometric theory of dynamical systems : an introduction , 1984 .

[49]  F. Clarke Generalized gradients and applications , 1975 .

[50]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[51]  Fabio Roli,et al.  Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.

[52]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[53]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[54]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[55]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[56]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[57]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[58]  G. Zoutendijk,et al.  Mathematical Programming Methods , 1976 .

[59]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[60]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[61]  Dmitriy Drusvyatskiy,et al.  Curves of Descent , 2012, SIAM J. Control. Optim..

[62]  Colin Wei,et al.  Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin , 2020, ICLR.

[63]  Lorenzo Rosasco,et al.  Theory III: Dynamics and Generalization in Deep Networks , 2019, ArXiv.

[64]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[65]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[66]  Dmitriy Drusvyatskiy,et al.  Stochastic Subgradient Method Converges on Tame Functions , 2018, Foundations of Computational Mathematics.

[67]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[68]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.