How Implicit Regularization of ReLU Neural Networks Characterizes the Learned Function -- Part I: the 1-D Case of Two Layers with Random First Layer

In this paper, we consider one dimensional (shallow) ReLU neural networks in which weights are chosen randomly and only the terminal layer is trained. First, we mathematically show that for such networks L2-regularized regression corresponds in function space to regularizing the estimate's second derivative for fairly general loss functionals. For least squares regression, we show that the trained network converges to the smooth spline interpolation of the training data as the number of hidden nodes tends to infinity. Moreover, we derive a novel correspondence between the early stopped gradient descent (without any explicit regularization of the weights) and the smoothing spline regression.

[1]  J. Teichmann,et al.  Optimal Stopping via Randomized Neural Networks , 2021, Frontiers of Mathematical Finance.

[2]  Albert Y. Zomaya,et al.  Partial Differential Equations , 2007, Explorations in Numerical Analysis.

[3]  Arthur Jacot,et al.  Implicit Regularization of Random Feature Models , 2020, ICML.

[4]  Nathan Srebro,et al.  A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case , 2019, ICLR.

[5]  Joan Bruna,et al.  Gradient Dynamics of Shallow Univariate ReLU Networks , 2019, NeurIPS.

[6]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[7]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[8]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[9]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[10]  T. Poggio,et al.  Theory IIIb: Generalization in Deep Networks , 2018, ArXiv.

[11]  Arthur Jacot,et al.  Neural Tangent Kernel: Convergence and Generalization in Neural Networks , 2018, NeurIPS.

[12]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[13]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[14]  Sylvain Gelly,et al.  Gradient Descent Quantizes ReLU Network Features , 2018, ArXiv.

[15]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[16]  Behnam Neyshabur,et al.  Implicit Regularization in Deep Learning , 2017, ArXiv.

[17]  Alexander Cloninger,et al.  Provable approximation properties for deep neural networks , 2015, ArXiv.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[20]  M. Bianchini,et al.  On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Attila Gilányi,et al.  An Introduction to the Theory of Functional Equations and Inequalities , 2008 .

[22]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[23]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[24]  C. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[25]  R. Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[26]  A. Pinkus,et al.  Original Contribution: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function , 1993 .

[27]  Yoshifusa Ito Approximation of functions on a compact set by finite sums of a sigmoid function without scaling , 1991, Neural Networks.

[28]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[29]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[30]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[31]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[32]  G. Wahba Improper Priors, Spline Smoothing and the Problem of Guarding Against Model Errors in Regression , 1978 .

[33]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[34]  C. Reinsch Smoothing by spline functions , 1967 .

[35]  J. Teichmann,et al.  Infinite wide (finite depth) Neural Networks benefit from multi-task learning unlike shallow Gaussian Processes - an exact quantitative macroscopic characterization , 2021, ArXiv.

[36]  Pradeep Ravikumar,et al.  Connecting Optimization and Regularization Paths , 2018, NeurIPS.

[37]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[38]  Bogdan E. Popescu,et al.  Gradient Directed Regularization for Linear Regression and Classi…cation , 2004 .

[39]  Christopher M. Bishop,et al.  Regularization and complexity control in feed-forward networks , 1995 .

[40]  Adi Ben-Israel,et al.  Generalized inverses: theory and applications , 1974 .