No . 067 April 4 , 2017 Theory of Deep Learning III : Generalization Properties of SGD by

In Theory III we characterize with a mix of theory and experiments the generalization properties of Stochastic Gradient Descent in overparametrized deep convolutional networks. We show that Stochastic Gradient Descent (SGD) selects with high probability solutions that 1) have zero (or small) empirical error, 2) are degenerate as shown in Theory II and 3) have maximum generalization. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF 123 1216. H.M. is supported in part by ARO Grant W911NF-15-10385.

[1]  S. Mitter,et al.  Recursive stochastic algorithms for global optimization in R d , 1991 .

[2]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[3]  W. Marsden I and J , 2012 .

[4]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5]  Lorenzo Rosasco,et al.  Optimal Learning for Multi-pass Stochastic Gradient Methods , 2016, NIPS.

[6]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[7]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[8]  T. Poggio,et al.  Memo No . 067 June 27 , 2017 Theory of Deep Learning III : Generalization Properties of SGD , 2017 .

[9]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[11]  Andreas Maurer,et al.  Bounds for Linear Multi-Task Learning , 2006, J. Mach. Learn. Res..

[12]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[13]  W. Hackbusch,et al.  On the Convergence of Alternating Least Squares Optimisation in Tensor Format Representations , 2015, 1506.00062.

[14]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[15]  S. Mitter,et al.  Metropolis-type annealing algorithms for global optimization in R d , 1993 .

[16]  Tomaso A. Poggio,et al.  Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.

[17]  B. Gidas Global optimization via the Langevin equation , 1985, 1985 24th IEEE Conference on Decision and Control.

[18]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[19]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[20]  P. Lennie Receptive fields , 2003, Current Biology.

[21]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[22]  Sanguthevar Rajasekaran,et al.  On the Convergence Time of Simulated Annealing , 1990 .

[23]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[24]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[25]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[26]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[27]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[28]  Mikhail Belkin,et al.  Diving into the shallows: a computational perspective on large-scale shallow learning , 2017, NIPS.

[29]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[30]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[31]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[32]  Neil Genzlinger A. and Q , 2006 .

[33]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[34]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[35]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[36]  S. Mitter,et al.  RECURSIVE STOCHASTIC ALGORITHMS FOR GLOBAL OPTIMIZATION IN , 2022 .

[37]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[38]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[39]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[40]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[41]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[42]  Ruslan Salakhutdinov,et al.  Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[43]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[44]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[45]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[46]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[47]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[48]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[49]  Lorenzo Rosasco,et al.  Deep Convolutional Networks are Hierarchical Kernel Machines , 2015, ArXiv.

[50]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[51]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[52]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..