论文信息 - On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

[1] J. Rissanen. A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[2] Michael C. Ferris,et al. Weak sharp minima and penalty functions in mathematical programming , 1988 .

[3] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[4] David J. C. MacKay,et al. A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[5] Jorge Nocedal,et al. A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[6] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.

[7] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8] L. Eon Bottou. Online Learning and Stochastic Approximations , 1998 .

[9] André Elisseeff,et al. Stability and Generalization , 2002, J. Mach. Learn. Res..

[10] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[11] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[12] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13] Dimitris Bertsimas,et al. Robust Optimization for Unconstrained Simulation-Based Problems , 2010, Oper. Res..

[14] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[15] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[16] Andrew Y. Ng,et al. Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[17] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[18] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[19] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[20] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[21] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[22] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[24] Mark W. Schmidt,et al. Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[25] Jorge Nocedal,et al. Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[26] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[29] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30] Jean-Philippe Vial,et al. Robust Optimization , 2021, ICORES.

[31] Alexander J. Smola,et al. Efficient mini-batch training for stochastic optimization , 2014, KDD.

[32] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33] Oriol Vinyals,et al. Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[34] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[36] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[37] Jonathon Shlens,et al. Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[38] Uri Shaham,et al. Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization , 2015, ArXiv.

[39] Yann LeCun,et al. Deep learning with Elastic Averaging SGD , 2014, NIPS.

[40] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] John Wright,et al. When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[42] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43] Yoram Singer,et al. Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[44] Hossein Mobahi,et al. Training Recurrent Neural Networks by Diffusion , 2016, ArXiv.

[45] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Albert S. Berahas,et al. adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs , 2015, ECML/PKDD.

[47] Daniel Soudry,et al. No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[48] Michael I. Jordan,et al. Gradient Descent Converges to Minimizers , 2016, ArXiv.

[49] Yang Song,et al. Improving the Robustness of Deep Neural Networks via Stability Training , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Pradeep Dubey,et al. Distributed Deep Learning Using Synchronous Stochastic Gradient Descent , 2016, ArXiv.

[51] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[52] Omer Levy,et al. Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .

[53] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..