Implicit dynamic regularization in deep networks

Square loss has been observed to perform well in classification tasks. However, a theoretical justification is lacking, unlike the cross-entropy case. Here we discuss several observations on the dynamics of gradient flow under the square loss in ReLU networks. We show how convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization (BN) or Weight Normalization (WN) are used together with zero-initial conditions on the layers weights. This is similar to the behavior of linear degenerate networks under gradient descent (GD), though the reason for zero-initial conditions is different. The main property of the minimizer that bounds its expected error is its norm: we prove that among all the interpolating solutions, the ones associated with smaller Frobenius norms of the weight matrices have better margin and better bounds on the expected classification error. The theory yields several predictions, including role of BN and weight decay, aspects of Papyan, Han and Donoho’s Neural Collapse and the constraints induced by BN on the network weights. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. Implicit dynamic regularization in deep networks Tomaso Poggio and Qianli Liao Abstract Square loss has been observed to perform well in classification tasks. However, a theoretical justification is lacking, unlike the cross-entropy [1] case for which an asymptotic analysis has been proposed (see [2] and [3] and references therein). Here we discuss several observations on the dynamics of gradient flow under the square loss in ReLU networks. We show how convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization[4] (BN) or Weight Normalization[5] (WN) are used together with zero-initial conditions on the layers weights. This is similar to the behavior of linear degenerate networks under gradient descent (GD), though the reason for zero-initial conditions is different. The main property of the minimizer that bounds its expected error is its norm: we prove that among all the interpolating solutions, the ones associated with smaller Frobenius norms of the weight matrices have better margin and better bounds on the expected classification error. The theory yields several predictions, including the joint role of BN and weight decay, aspects of Papyan, Han and Donoho’s Neural Collapse and the constraints induced by BN on the network weights.Square loss has been observed to perform well in classification tasks. However, a theoretical justification is lacking, unlike the cross-entropy [1] case for which an asymptotic analysis has been proposed (see [2] and [3] and references therein). Here we discuss several observations on the dynamics of gradient flow under the square loss in ReLU networks. We show how convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization[4] (BN) or Weight Normalization[5] (WN) are used together with zero-initial conditions on the layers weights. This is similar to the behavior of linear degenerate networks under gradient descent (GD), though the reason for zero-initial conditions is different. The main property of the minimizer that bounds its expected error is its norm: we prove that among all the interpolating solutions, the ones associated with smaller Frobenius norms of the weight matrices have better margin and better bounds on the expected classification error. The theory yields several predictions, including the joint role of BN and weight decay, aspects of Papyan, Han and Donoho’s Neural Collapse and the constraints induced by BN on the network weights.

[1]  Tomaso Poggio,et al.  Complexity control by gradient descent in deep networks , 2020, Nature Communications.

[2]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[3]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[4]  Daniel Kunin,et al.  Loss Landscapes of Regularized Linear Autoencoders , 2019, ICML.

[5]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[6]  Qianli Liao,et al.  Theoretical issues in deep networks , 2020, Proceedings of the National Academy of Sciences.

[7]  Amit Daniely,et al.  The Implicit Bias of Depth: How Incremental Learning Drives Generalization , 2020, ICLR.

[8]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[9]  Paulo Jorge S. G. Ferreira,et al.  The existence and uniqueness of the minimum norm solution to certain linear and nonlinear problems , 1996, Signal Process..

[10]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[13]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[14]  Colin Wei,et al.  Shape Matters: Understanding the Implicit Bias of the Noise Covariance , 2020, COLT.

[15]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, ArXiv.

[16]  R. Rockafellar,et al.  Implicit Functions and Solution Mappings , 2009 .

[17]  Dacheng Tao,et al.  Orthogonal Deep Neural Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Mikhail Belkin,et al.  Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks , 2020, ArXiv.

[19]  Davide Anguita,et al.  Tikhonov, Ivanov and Morozov regularization for support vector machine learning , 2015, Machine Learning.

[20]  R. Douglas,et al.  Neuronal circuits of the neocortex. , 2004, Annual review of neuroscience.

[21]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[22]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[23]  Lorenzo Rosasco,et al.  For interpolating kernel machines, minimizing the norm of the ERM solution minimizes stability. , 2020 .

[24]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[25]  Tomaso Poggio,et al.  Loss landscape: SGD can have a better view than GD , 2020 .

[26]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[27]  Tomaso Poggio,et al.  Loss landscape: SGD has a better view , 2020 .