Implicit bias of gradient descent for mean squared error regression with wide neural networks

We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. Focusing on 1D regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from initialization has smallest 2-norm of the second derivative weighted by $1/\zeta$. The curvature penalty function $1/\zeta$ is expressed in terms of the probability distribution that is utilized to initialize the network parameters, and we compute it explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. The statement generalizes to the training trajectories, which in turn are captured by trajectories of spatially adaptive smoothing splines with decreasing regularization strength.

[1]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[2]  A solution of an integral equation , 1969 .

[3]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[4]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[5]  Roman Vershynin,et al.  Four lectures on probabilistic methods for data science , 2016, IAS/Park City Mathematics Series.

[6]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[7]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[8]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[9]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[10]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[11]  Christopher M. Bishop,et al.  Regularization and complexity control in feed-forward networks , 1995 .

[12]  Felix Abramovich,et al.  Improved inference in nonparametric regression using Lk-smoothing splines , 1996 .

[13]  David Rolnick,et al.  Complexity of Linear Regions in Deep Networks , 2019, ICML.

[14]  Ruslan Salakhutdinov,et al.  Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[15]  Zhanxing Zhu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[16]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[17]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[18]  Ankit B. Patel,et al.  A Functional Characterization of Randomly Initialized Gradient Descent in Deep ReLU Networks , 2019 .

[19]  Radford M. Neal Priors for Infinite Networks , 1996 .

[20]  G. Lo,et al.  Weak Convergence (IA). Sequences of Random Vectors , 2016, 1610.05415.

[21]  Josef Teichmann,et al.  How implicit regularization of Neural Networks affects the learned function - Part I , 2019, ArXiv.

[22]  Zheng Ma,et al.  A type of generalization error induced by initialization in deep neural networks , 2019, MSML.

[23]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[24]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[25]  Christopher Holmes,et al.  Spatially adaptive smoothing splines , 2006 .

[26]  Sylvain Gelly,et al.  Gradient Descent Quantizes ReLU Network Features , 2018, ArXiv.

[27]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[28]  Nathan Srebro,et al.  A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case , 2019, ICLR.

[29]  Robert D. Nowak,et al.  Minimum "Norm" Neural Networks are Splines , 2019, ArXiv.

[30]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[31]  J. L. Walsh,et al.  The theory of splines and their applications , 1969 .

[32]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[33]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[34]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[35]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[36]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[37]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.