On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias

We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Unlike many other results in the literature, under an additional assumption on the distribution of the data, our result holds even for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.

[1]  O. Shamir,et al.  Implicit Regularization Towards Rank Minimization in ReLU Networks , 2022, ALT.

[2]  Gal Vardi On the Implicit Bias in Deep-Learning Algorithms , 2022, Commun. ACM.

[3]  Ankit Patel,et al.  Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics , 2020, Frontiers in Artificial Intelligence.

[4]  Nathan Srebro,et al.  On Margin Maximization in Linear and ReLU Networks , 2021, NeurIPS.

[5]  Boris Hanin,et al.  Ridgeless Interpolation with Shallow ReLU Networks in $1D$ is Nearest Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz Functions , 2021, ArXiv.

[6]  Yossi Arjevani,et al.  Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural Networks , 2021, NeurIPS.

[7]  Nadav Cohen,et al.  Continuous vs. Discrete Optimization of Deep Neural Networks , 2021, NeurIPS.

[8]  O. Shamir,et al.  Implicit Regularization in ReLU Networks with the Square Loss , 2020, COLT.

[9]  Gilad Yehudai,et al.  The Effects of Mild Over-parameterization on the Optimization Landscape of Shallow ReLU Neural Networks , 2020, COLT.

[10]  Mert Pilanci,et al.  Convex Geometry and Duality of Over-parameterized Neural Networks , 2020, J. Mach. Learn. Res..

[11]  Mert Pilanci,et al.  Revealing the Structure of Deep Neural Networks via Convex Duality , 2020, ICML.

[12]  Aaditya Ramdas,et al.  Path Length Bounds for Gradient Descent and Flow , 2019, J. Mach. Learn. Res..

[13]  Mary Phuong,et al.  The inductive bias of ReLU networks on orthogonally separable data , 2021, ICLR.

[14]  Yossi Arjevani,et al.  Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale of Symmetry , 2020, NeurIPS.

[15]  Matus Telgarsky,et al.  Directional convergence and alignment in deep learning , 2020, NeurIPS.

[16]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[17]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[18]  Matus Telgarsky,et al.  Neural tangent kernels, transportation mappings, and universal approximation , 2019, ICLR.

[19]  R. Nowak,et al.  The Role of Neural Network Activation Functions , 2019, IEEE Signal Processing Letters.

[20]  Nathan Srebro,et al.  A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case , 2019, ICLR.

[21]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[22]  Guy Blanc,et al.  Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process , 2019, COLT.

[23]  Tengyu Ma,et al.  Learning Over-Parametrized Two-Layer Neural Networks beyond NTK , 2020, COLT.

[24]  Joan Bruna,et al.  Gradient Dynamics of Shallow Univariate ReLU Networks , 2019, NeurIPS.

[25]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[26]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[27]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[28]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[29]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[30]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[31]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[32]  Sylvain Gelly,et al.  Gradient Descent Quantizes ReLU Network Features , 2018, ArXiv.

[33]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[34]  Behnam Neyshabur,et al.  Implicit Regularization in Deep Learning , 2017, ArXiv.

[35]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[36]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[37]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[39]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[40]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[41]  Kalyanmoy Deb,et al.  Approximate KKT points and a proximity measure for termination , 2013, J. Glob. Optim..

[42]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[43]  Yu. S. Ledyaev,et al.  Nonsmooth analysis and control theory , 1998 .

[44]  D. Owen Tables for Computing Bivariate Normal Probabilities , 1956 .