On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

[1]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[2]  Michael C. Ferris,et al.  Weak sharp minima and penalty functions in mathematical programming , 1988 .

[3]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[4]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[5]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[6]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[9]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[10]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[11]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Dimitris Bertsimas,et al.  Robust Optimization for Unconstrained Simulation-Based Problems , 2010, Oper. Res..

[14]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[15]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[16]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[17]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[20]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[21]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[24]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[25]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[26]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30]  Jean-Philippe Vial,et al.  Robust Optimization , 2021, ICORES.

[31]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[32]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[36]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[37]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[38]  Uri Shaham,et al.  Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization , 2015, ArXiv.

[39]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[40]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[44]  Hossein Mobahi,et al.  Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Albert S. Berahas,et al.  adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs , 2015, ECML/PKDD.

[47]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[48]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[49]  Yang Song,et al.  Improving the Robustness of Deep Neural Networks via Stability Training , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Pradeep Dubey,et al.  Distributed Deep Learning Using Synchronous Stochastic Gradient Descent , 2016, ArXiv.

[51]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[52]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .

[53]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..