Sharp Minima Can Generalize For Deep Nets

Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.

[1]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[2]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[3]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[4]  Aapo Hyvärinen,et al.  Nonlinear independent component analysis: Existence and uniqueness results , 1999, Neural Networks.

[5]  A. Klyachko Random walks on symmetric spaces and inequalities for matrix spectra , 2000 .

[6]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[7]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[8]  Léon Bottou,et al.  On-line learning for very large data sets: Research Articles , 2005 .

[9]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[10]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[11]  Yurii Nesterov,et al.  Confidence level solutions for stochastic programming , 2000, Autom..

[12]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[14]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[15]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[16]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[22]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[23]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[24]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[27]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[28]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[29]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[30]  Yann LeCun,et al.  Explorations on high dimensional landscapes , 2014, ICLR.

[31]  Roberto Cipolla,et al.  Understanding symmetries in deep networks , 2015, ArXiv.

[32]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[33]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[34]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[35]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[36]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[37]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[39]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[42]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Daniel Jiwoong Im,et al.  An empirical analysis of the optimization of deep network loss surfaces , 2016, 1612.04010.

[46]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[47]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[48]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[49]  Yann LeCun,et al.  Singularity of the Hessian in Deep Learning , 2016, ArXiv.

[50]  Razvan Pascanu,et al.  Local minima in training of deep networks , 2017, ArXiv.

[51]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[52]  Daniel Jiwoong Im,et al.  An Empirical Analysis of Deep Network Loss Surfaces , 2016, ArXiv.

[53]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[54]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[55]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[56]  Venu Govindaraju,et al.  Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks , 2016, ICML.

[57]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[58]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[59]  Shai Shalev-Shwartz,et al.  Fast Rates for Empirical Risk Minimization of Strict Saddle Problems , 2017, COLT.

[60]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[61]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[62]  Yoshua Bengio,et al.  Sharp Minima Can Generalize For Deep Nets Supplementary Material , 2017 .

[63]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[64]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[65]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[66]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..