Theory IIIb: Generalization in Deep Networks

A main puzzle of deep neural networks (DNNs) revolves around the apparent absence of “overfitting”, defined in this paper as follows: the expected error does not get worse when increasing the number of neurons or of iterations of gradient descent. This is surprising because of the large capacity demonstrated by DNNs to fit randomly labeled data and the absence of explicit regularization. Recent results by Srebro et al. provide a satisfying solution of the puzzle for linear networks used in binary classification. They prove that minimization of loss functions such as the logistic, the cross-entropy and the exp-loss yields asymptotic, “slow” convergence to the maximum margin solution for linearly separable datasets, independently of the initial conditions. Here we prove a similar result for nonlinear multilayer DNNs near zero minima of the empirical loss. The result holds for exponential-type losses but not for the square loss. In particular, we prove that the normalized weight matrix at each layer of a deep network converges to a minimum norm solution (in the separable case). Our analysis of the dynamical system corresponding to gradient descent of a multilayer network suggests a simple criterion for predicting the generalization performance of different zero minimizers of the empirical loss. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. Theory IIIb: Generalization in Deep Networks Tomaso Poggio ∗1, Qianli Liao1, Brando Miranda1, Andrzej Banburski1, Xavier Boix1, and Jack Hidary2 1Center for Brains, Minds and Machines, MIT 2Alphabet (Google) X

[1]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[2]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[3]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[4]  Tomaso A. Poggio,et al.  Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.

[5]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[6]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[7]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2017, IEEE Transactions on Signal Processing.

[8]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[9]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[10]  Noah Golowich,et al.  Musings on Deep Learning: Properties of SGD , 2017 .

[11]  Tomaso A. Poggio,et al.  Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.

[12]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[13]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[14]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[15]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[16]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[17]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[18]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[19]  김희라 Waiting for Godot에 나타난 희망의 구조 , 2003 .

[20]  B. Aulbach,et al.  The Hartman-Grobman theorem for Carathéodory-type differential equations in Banach spaces , 2000 .

[21]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[22]  Aleksej F. Filippov,et al.  Differential Equations with Discontinuous Righthand Sides , 1988, Mathematics and Its Applications.

[23]  G. P. Szegö,et al.  Stability theory of dynamical systems , 1970 .