Deep Classifiers trained with the Square Loss

Here we consider a model of the dynamics of gradient flow under the square loss in overparametrized ReLU networks. We show that convergence to a solution with the absolute minimum ρ , which is the product of the Frobenius norms of each layer weight matrix, is expected when normalization by a Lagrange multiplier (LN) is used together with Weight Decay (WD). We prove that SGD converges to solutions that have a bias towards 1) large margin (i.e. small ρ ) and 2) low rank of the weight matrices. In addition, we predict the occurence of Neural Collapse without ad hoc assumptions such as the unconstrained features hypothesis. Abstract Recent results of [1] suggest that square loss performs on par with cross-entropy loss in classification tasks with deep networks. While the theoretical understanding of training deep networks with the cross-entropy loss has been growing ([2] and [3]) in terms of margin maximization, the study of square loss for classification has been lagging behind. Here we consider a specific model of the dynamics of, first, gradient flow (GF), and then SGD, in overparametrized ReLU networks under the square loss. Under the assumption of convergence to zero loss minima, we show that solutions have a bias toward small ρ , defined as the product of the Frobenius norms of each layer unnormalized weight matrix. We assume that during training there is normalization using a Lagrange multiplier (LM) of each layer weight matrix but the last one, together with Weight Decay (WD). For λ → 0 the solution would be the interpolating solution with minimum ρ . In the absence of LM+WD, good solutions for classification may still be achieved because of the implicit bias towards small norm solutions in the GD dynamics introduced by carefully chosen close-to-zero initial conditions on the norms of the weights, similar to the case of overparametrized linear networks (see Appendix E). However, for λ = 0 we often observe solutions with large ρ that are suboptimal and probably in the NTK regime. We show that convergence to an ideal equilibrium of SGD ( ˙ V k = 0 for all minibatches) with λ > 0 would imply rank one weight matrices. This is impossible generically, implying that SGD never converges to the same set of V k across all minibatches. We claim that this is the origin of

[1]  T. Poggio,et al.  SGD Noise and Implicit Low-Rank Bias in Deep Neural Networks , 2022, ArXiv.

[2]  S. Chatterjee Convergence of gradient descent for deep neural networks , 2022, ArXiv.

[3]  Zhihui Zhu,et al.  On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features , 2022, ICML.

[4]  O. Shamir,et al.  Implicit Regularization Towards Rank Minimization in ReLU Networks , 2022, ALT.

[5]  X. Y. Han,et al.  Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path , 2021, ICLR.

[6]  Dustin G. Mixon,et al.  Neural collapse with unconstrained features , 2020, Sampling Theory, Signal Processing, and Data Analysis.

[7]  Zhihui Zhu,et al.  A Geometric Analysis of Neural Collapse with Unconstrained Features , 2021, NeurIPS.

[8]  Benjamin Recht,et al.  Interpolating Classifiers Make Few Mistakes , 2021, J. Mach. Learn. Res..

[9]  Surya Ganguli,et al.  Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics , 2020, ICLR.

[10]  D. Barrett,et al.  Implicit Gradient Regularization , 2020, ICLR.

[11]  Mikhail Belkin,et al.  Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks , 2020, ICLR.

[12]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[13]  Mert Pilanci,et al.  Revealing the Structure of Deep Neural Networks via Convex Duality , 2020, ICML.

[14]  Dacheng Tao,et al.  Orthogonal Deep Neural Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yaim Cooper Global Minima of Overparameterized Neural Networks , 2021, SIAM J. Math. Data Sci..

[16]  Hangfeng He,et al.  Layer-Peeled Model: Toward Understanding Well-Trained Deep Neural Networks , 2021, ArXiv.

[17]  Stefan Steinerberger,et al.  Neural Collapse with Cross-Entropy Loss , 2020, ArXiv.

[18]  E. Weinan,et al.  On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers , 2020, ArXiv.

[19]  Grant M. Rotskoff,et al.  A Dynamical Central Limit Theorem for Shallow Neural Networks , 2020, NeurIPS.

[20]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[21]  Qianli Liao,et al.  Theoretical issues in deep networks , 2020, Proceedings of the National Academy of Sciences.

[22]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[23]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[24]  S. Shalev-Shwartz,et al.  The Implicit Bias of Depth: How Incremental Learning Drives Generalization , 2019, ICLR.

[25]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[26]  Tomaso Poggio,et al.  Generalization in deep network classifiers trained with the square loss1 , 2020 .

[27]  Tomaso Poggio,et al.  Loss landscape: SGD has a better view , 2020 .

[28]  Tomaso Poggio,et al.  Loss landscape: SGD can have a better view than GD , 2020 .

[29]  Marius Kloft,et al.  Improved Generalisation Bounds for Deep Learning Through L∞ Covering Numbers , 2019, ArXiv.

[30]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[31]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[32]  Daniel Kunin,et al.  Loss Landscapes of Regularized Linear Autoencoders , 2019, ICML.

[33]  Quynh Nguyen,et al.  On Connected Sublevel Sets in Deep Learning , 2019, ICML.

[34]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[35]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[36]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[37]  Hossein Mobahi,et al.  Predicting the Generalization Gap in Deep Networks with Margin Distributions , 2018, ICLR.

[38]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[39]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[40]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[41]  Yi Zhou,et al.  When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models? , 2018 .

[42]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[43]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[44]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[45]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[46]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[47]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[48]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[49]  Lorenzo Rosasco,et al.  On Invariance and Selectivity in Representation Learning , 2015, ArXiv.

[50]  A. Blum 10-806 Foundations of Machine Learning and Data Science , 2015 .

[51]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[52]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[53]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[54]  T. Poggio,et al.  Statistical Learning: Stability is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization , 2002 .

[55]  Tomaso Poggio,et al.  Everything old is new again: a fresh look at historical approaches in machine learning , 2002 .

[56]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[57]  P. Welch The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms , 1967 .