Understanding deep learning (still) requires rethinking generalization

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.

[1]  Amnon Shashua,et al.  Convolutional Rectifier Networks as Generalized Tensor Decompositions , 2016, ICML.

[2]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[3]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[4]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[5]  Vitaly Feldman,et al.  Does learning require memorization? a short tale about a long tail , 2019, STOC.

[6]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[7]  T. Poggio,et al.  Statistical Learning: Stability is Sufficient for Generalization and Necessary and Sufficient for Consistency of Empirical Risk Minimization , 2002 .

[8]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[9]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[10]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[11]  Stefano Ermon,et al.  Bias and Generalization in Deep Generative Models: An Empirical Study , 2018, NeurIPS.

[12]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[13]  Samy Bengio,et al.  Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.

[14]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[16]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[18]  D. Costarelli,et al.  Constructive Approximation by Superposition of Sigmoidal Functions , 2013 .

[19]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[20]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[21]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[22]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[23]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[24]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[25]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[26]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[27]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[28]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[29]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[30]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[31]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[32]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[33]  Lorenzo Rosasco,et al.  Generalization Properties and Implicit Regularization for Multiple Passes SGM , 2016, ICML.

[34]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[35]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[36]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[37]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[38]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[39]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[40]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[41]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[42]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[43]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[44]  Anastasios Kyrillidis,et al.  Minimum norm solutions do not always generalize well for over-parameterized problems , 2018, ArXiv.