Generalization in Deep Learning

This paper provides non-vacuous and numerically-tight generalization guarantees for deep learning, as well as theoretical insights into why and how deep learning can generalize well, despite its large capacity, complexity, possible algorithmic instability, nonrobustness, and sharp minima, responding to an open question in the literature. We also propose new open problems and discuss the limitations of our results.

[1]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[2]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[7]  Nikola S. Nikolov,et al.  How to Layer a Directed Acyclic Graph , 2001, GD.

[8]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[9]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[10]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[11]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[12]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[13]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[14]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[15]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[16]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[17]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[18]  Razvan Pascanu,et al.  On the number of response regions of deep feed forward networks with piece-wise linear activations , 2013, 1312.6098.

[19]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[20]  Benjamin Graham,et al.  Fractional Max-Pooling , 2014, ArXiv.

[21]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[22]  Shai Ben-David,et al.  Understanding Machine Learning - From Theory to Algorithms , 2014 .

[23]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[24]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[25]  Kensuke Yokoi,et al.  APAC: Augmented PAttern Classification with Neural Networks , 2015, ArXiv.

[26]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[27]  Pengtao Xie,et al.  On the Generalization Error Bounds of Neural Networks under Diversity-Inducing Mutual Angular Regularization , 2015, ArXiv.

[28]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[29]  Leslie Pack Kaelbling,et al.  Bayesian Optimization with Exponential Convergence , 2015, NIPS.

[30]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[31]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[32]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[33]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[34]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[35]  Kenji Kawaguchi,et al.  Bounded Optimal Exploration in MDP , 2016, AAAI.

[36]  Tie-Yan Liu,et al.  On the Depth of Deep Neural Networks: A Theoretical View , 2015, AAAI.

[37]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[38]  Tegan Maharaj,et al.  Deep Nets Don't Learn via Memorization , 2017, ICLR.

[39]  Zhanxing Zhu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[40]  Shai Shalev-Shwartz,et al.  Fast Rates for Empirical Risk Minimization of Strict Saddle Problems , 2017, COLT.

[41]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[42]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[43]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[44]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[45]  T. Poggio,et al.  Memo No . 067 June 27 , 2017 Theory of Deep Learning III : Generalization Properties of SGD , 2017 .

[46]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[47]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[48]  Guillermo Sapiro,et al.  Generalization Error of Invariant Classifiers , 2016, AISTATS.

[49]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[50]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2017, IEEE Transactions on Signal Processing.

[51]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[53]  Yoshua Bengio,et al.  Towards Understanding Generalization via Analytical Learning Theory , 2018, 1802.07426.

[54]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[55]  Christoph H. Lampert,et al.  Data-Dependent Stability of Stochastic Gradient Descent , 2017, ICML.