On Margin Maximization in Linear and ReLU Networks

The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed.

[1]  Gal Vardi On the Implicit Bias in Deep-Learning Algorithms , 2022, Commun. ACM.

[2]  O. Shamir,et al.  Reconstructing Training Data from Trained Neural Networks , 2022, NeurIPS.

[3]  Jason D. Lee,et al.  On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias , 2022, NeurIPS.

[4]  O. Shamir,et al.  Gradient Methods Provably Converge to Non-Robust Networks , 2022, NeurIPS.

[5]  O. Shamir,et al.  Implicit Regularization Towards Rank Minimization in ReLU Networks , 2022, ALT.

[6]  Sanjeev Arora,et al.  Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias , 2021, NeurIPS.

[7]  Ilya P. Razenshteyn,et al.  Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm , 2021, COLT.

[8]  Nathan Srebro,et al.  On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent , 2021, ICML.

[9]  Amir Globerson,et al.  Towards Understanding Learning in Neural Networks with Linear Teachers , 2021, ICML.

[10]  Kaifeng Lyu,et al.  Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , 2020, ICLR.

[11]  O. Shamir,et al.  Implicit Regularization in ReLU Networks with the Square Loss , 2020, COLT.

[12]  H. Mobahi,et al.  A Unifying View on Implicit Bias in Training Linear Neural Networks , 2020, ICLR.

[13]  Armin Eftekhari,et al.  Implicit Regularization in Matrix Sensing: A Geometric View Leads to Stronger Results , 2020, ArXiv.

[14]  Nathan Srebro,et al.  Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy , 2020, NeurIPS.

[15]  Ohad Shamir,et al.  Gradient Methods Never Overfit On Separable Data , 2020, J. Mach. Learn. Res..

[16]  Matus Telgarsky,et al.  Gradient descent follows the regularization path for general losses , 2020, COLT.

[17]  Matus Telgarsky,et al.  Directional convergence and alignment in deep learning , 2020, NeurIPS.

[18]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[19]  Mert Pilanci,et al.  Convex Geometry and Duality of Over-parameterized Neural Networks , 2020, J. Mach. Learn. Res..

[20]  Mert Pilanci,et al.  Revealing the Structure of Deep Neural Networks via Convex Duality , 2020, ICML.

[21]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[22]  Mohamed Ali Belabbas,et al.  On implicit regularization: Morse functions and applications to matrix factorization , 2020, ArXiv.

[23]  Nathan Srebro,et al.  Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[24]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[25]  Matus Telgarsky,et al.  Characterizing the implicit bias via a primal-dual analysis , 2019, ALT.

[26]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[27]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[28]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[29]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[30]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[31]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[32]  Matus Telgarsky,et al.  Risk and parameter convergence of logistic regression , 2018, ArXiv.

[33]  Nathan Srebro,et al.  Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[34]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[35]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[36]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[37]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[38]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[39]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[40]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[41]  Kalyanmoy Deb,et al.  Approximate KKT points and a proximity measure for termination , 2013, J. Glob. Optim..

[42]  Mary Phuong,et al.  The inductive bias of ReLU networks on orthogonally separable data , 2021, ICLR.

[43]  Yuxin Chen,et al.  Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval and Matrix Completion , 2018, ICML.

[44]  Ying Xiong Nonlinear Optimization , 2014 .

[45]  Yu. S. Ledyaev,et al.  Nonsmooth analysis and control theory , 1998 .