Generalization Puzzles in Deep Networks

In the last few years, deep learning has been tremendously successful in many applications. However, our theoretical understanding of deep learning, and thus the ability of providing principled improvements, seems to lag behind. A theoretical puzzle concerns the ability of deep networks to predict well despite overparametrization and despite the lack of complexity control under the form of explicit regularization. Do deep networks require a drastically new theory of generalization? In particular, are there measurements based on the training data that are predictive of the network performance on future data? Here we show that when performance is measured appropriately, the training performance is in fact predictive of expected performance, consistently with classical machine learning theory and the presence of an implicit regularization.

[1]  Tomaso A. Poggio,et al.  Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.

[2]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[3]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[4]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[5]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[6]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[7]  Lorenzo Rosasco,et al.  Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.

[8]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[9]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[11]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[12]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[13]  Nathan Srebro,et al.  Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate , 2018, AISTATS.

[14]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[15]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[16]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Lorenzo Rosasco,et al.  Theory III: Dynamics and Generalization in Deep Networks , 2019, ArXiv.

[19]  Michael W. Mahoney,et al.  Traditional and Heavy-Tailed Self Regularization in Neural Network Models , 2019, ICML.

[20]  Tomaso A. Poggio,et al.  Theory IIIb: Generalization in Deep Networks , 2018, ArXiv.

[21]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[22]  Tomaso A. Poggio,et al.  Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.

[23]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..