Directional convergence and alignment in deep learning

In this paper, we show that although the minimizers of cross-entropy and related classification losses are off at infinity, network weights learned by gradient flow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge. This proof holds for deep homogeneous networks -- a broad class of networks allowing for ReLU, max pooling, linear, and convolutional layers -- and we additionally provide empirical support not just close to the theory (e.g., the AlexNet), but also on non-homogeneous networks (e.g., the ResNet). If the network further has locally Lipschitz gradients, we show that these gradients converge in direction, and asymptotically align with the gradient flow path, with consequences on margin maximization. Our analysis complements and is distinct from the well-known neural tangent and mean-field theories, and in particular makes no requirements on network width and initialization, instead merely requiring perfect classification accuracy. The proof proceeds by developing a theory of unbounded nonsmooth Kurdyka-Lojasiewicz inequalities for functions definable in an o-minimal structure, and is also applicable outside deep learning.

[1]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[2]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[3]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[4]  Dmitriy Drusvyatskiy,et al.  Stochastic Subgradient Method Converges on Tame Functions , 2018, Foundations of Computational Mathematics.

[5]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[6]  Matus Telgarsky,et al.  A refined primal-dual analysis of the implicit bias , 2019, ArXiv.

[7]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[8]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[9]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[10]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[11]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[12]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[13]  Hossein Mobahi,et al.  Predicting the Generalization Gap in Deep Networks with Margin Distributions , 2018, ICLR.

[14]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[15]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[16]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[17]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[18]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[19]  Matus Telgarsky,et al.  Risk and parameter convergence of logistic regression , 2018, ArXiv.

[20]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[21]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[22]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[23]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Ta Lê Loi,et al.  Lecture 1: O-minimal structures , 2010 .

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  V. Grandjean,et al.  On the limit set at infinity of a gradient trajectory of a semialgebraic function , 2007 .

[29]  Adrian S. Lewis,et al.  Clarke Subgradients of Stratifiable Functions , 2006, SIAM J. Optim..

[30]  Adam Parusinski,et al.  Quasi-convex decomposition in o-minimal structures. Application to the gradient conjecture , 2006 .

[31]  M. Coste AN INTRODUCTION TO O-MINIMAL GEOMETRY , 2002 .

[32]  K. Kurdyka,et al.  SEMIALGEBRAIC SARD THEOREM FOR GENERALIZED CRITICAL VALUES , 2000 .

[33]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[34]  K. Kurdyka,et al.  Proof of the gradient conjecture of R. Thom , 1999, math/9906212.

[35]  K. Kurdyka On gradients of functions definable in o-minimal structures , 1998 .

[36]  L. Dries,et al.  Geometric categories and o-minimal structures , 1996 .

[37]  A. Wilkie Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function , 1996 .

[38]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[39]  András Némethi,et al.  Milnor fibration at infinity , 1992 .

[40]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[41]  F. Clarke Generalized gradients and applications , 1975 .