Learning-Rate Annealing Methods for Deep Neural Networks

Deep neural networks (DNNs) have achieved great success in the last decades. DNN is optimized using the stochastic gradient descent (SGD) with learning rate annealing that overtakes the adaptive methods in many tasks. However, there is no common choice regarding the scheduled-annealing for SGD. This paper aims to present empirical analysis of learning rate annealing based on the experimental results using the major data-sets on the image classification that is one of the key applications of the DNNs. Our experiment involves recent deep neural network models in combination with a variety of learning rate annealing methods. We also propose an annealing combining the sigmoid function with warmup that is shown to overtake both the adaptive methods and the other existing schedules in accuracy in most cases with DNNs.

[1]  Gang Wang,et al.  Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[3]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[4]  Leon Wenliang Zhong,et al.  Fast Stochastic Alternating Direction Method of Multipliers , 2013, ICML.

[5]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[6]  Michael I. Jordan,et al.  On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo , 2018, ICML.

[7]  Heng Huang,et al.  Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization , 2016, AAAI 2016.

[8]  Nuno Lourenço,et al.  Evolving Learning Rate Optimizers for Deep Neural Networks , 2021, ArXiv.

[9]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[10]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[11]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[12]  Michael Unser,et al.  Deep Convolutional Neural Network for Inverse Problems in Imaging , 2016, IEEE Transactions on Image Processing.

[13]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[14]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[15]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Yoshua Bengio,et al.  A Walk with SGD , 2018, ArXiv.

[17]  H. Robbins A Stochastic Approximation Method , 1951 .

[18]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[19]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[20]  M. G. Madden,et al.  Graph convolutional networks: analysis, improvements and results , 2019, Applied Intelligence.

[21]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[22]  Dit-Yan Yeung,et al.  Collaborative Deep Learning for Recommender Systems , 2014, KDD.

[23]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[24]  Quanquan Gu,et al.  Stochastic Variance-Reduced Hamilton Monte Carlo Methods , 2018, ICML.

[25]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[26]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[27]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[28]  David W. Jacobs,et al.  Automated Inference with Adaptive Batches , 2017, AISTATS.

[29]  Cho-Jui Hsieh,et al.  Fast Variance Reduction Method with Stochastic Batch Size , 2018, ICML.

[30]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[31]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[32]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[33]  Riccardo La Grassa,et al.  Combining Optimization Methods Using an Adaptive Meta Optimizer , 2021, Algorithms.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Kfir Y. Levy,et al.  Online to Offline Conversions, Universality and Adaptive Minibatch Sizes , 2017, NIPS.

[36]  Shiyu Chang,et al.  Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization , 2018, NeurIPS.

[37]  Shu-Ching Chen,et al.  T-LRA: Trend-Based Learning Rate Annealing for Deep Neural Networks , 2017, 2017 IEEE Third International Conference on Multimedia Big Data (BigMM).

[38]  Javier Romero,et al.  Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Lei Zhang,et al.  Learning a Deep Single Image Contrast Enhancer from Multi-Exposure Images , 2018, IEEE Transactions on Image Processing.

[41]  Nicola Conci,et al.  How Deep Features Have Improved Event Recognition in Multimedia , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[42]  Fanhua Shang,et al.  A Simple Stochastic Variance Reduced Algorithm with Fast Convergence Rates , 2018, ICML.

[43]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[44]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[45]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[46]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[47]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[48]  Mathieu Lagrange,et al.  Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[49]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[52]  Florian Metze,et al.  A comparison of Deep Learning methods for environmental sound detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[54]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55]  Zebang Shen,et al.  Adaptive Variance Reducing for Stochastic Gradient Descent , 2016, IJCAI.

[56]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[57]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[58]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Dacheng Tao,et al.  Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence , 2019, NeurIPS.

[60]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[62]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.