SGDR: Stochastic Gradient Descent with Warm Restarts

Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at this https URL

[1]  C. M. Reeves,et al.  Function minimization by conjugate gradients , 1964, Comput. J..

[2]  M. J. D. Powell,et al.  Restart procedures for the conjugate gradient method , 1977, Math. Program..

[3]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[4]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[5]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[6]  Duan Li,et al.  On Restart Procedures for the Conjugate Gradient Method , 2004, Numerical Algorithms.

[7]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[8]  Nikolaus Hansen,et al.  Evaluating the CMA Evolution Strategy on Multimodal Test Functions , 2004, PPSN.

[9]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[10]  Raymond Ros,et al.  Benchmarking the BFGS algorithm on the BBOB-2009 function testbed , 2009, GECCO '09.

[11]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[12]  Nikolaus Hansen,et al.  Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed , 2009, GECCO '09.

[13]  Mike Preuss,et al.  Niching the CMA-ES via nearest-better clustering , 2010, GECCO '10.

[14]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Michèle Sebag,et al.  Alternative Restart Strategies for CMA-ES , 2012, PPSN.

[17]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[19]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[20]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[21]  Leslie N. Smith,et al.  No More Pesky Learning Rate Guessing Games , 2015, ArXiv.

[22]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[23]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[24]  Ya Le,et al.  Tiny ImageNet Visual Recognition Challenge , 2015 .

[25]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[26]  Mike Preuss,et al.  Niching Methods and Multimodal Optimization Performance , 2015 .

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[29]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[30]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[33]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[34]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[35]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36]  Wolfram Burgard,et al.  Deep learning with convolutional neural networks for brain mapping and decoding of movement-related information from the human EEG , 2017, ArXiv.

[37]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Junmo Kim,et al.  Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[40]  Ke Zhang,et al.  Residual Networks of Residual Networks: Multilevel Residual Networks , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[41]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .