A Decreasing Scaling Transition Scheme from Adam to SGD

Adaptive gradient algorithm (AdaGrad) and its variants, such as RMSProp, Adam, AMSGrad, etc, have been widely used in deep learning. Although these algorithms are faster in the early phase of training, their generalization performance is often not as good as stochastic gradient descent (SGD). Hence, a trade-off method of transforming Adam to SGD after a certain iteration to gain the merits of both algorithms is theoretically and practically significant. To that end, we propose a decreasing scaling transition scheme to achieve a smooth and stable transition from Adam to SGD, which is called DSTAdam. The convergence of the proposed DSTAdam is also proved in an online convex setting. Finally, the effectiveness of the DSTAdam is verified on the CIFAR-10/100 datasets. Our implementation is available at: https://github.com/kunzeng/DSTAdam.

[1]  Bernabe Linares-Barranco,et al.  Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications , 2020, IEEE Transactions on Biomedical Circuits and Systems.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[4]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[5]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[6]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yong Yu,et al.  AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.

[9]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[10]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[11]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[12]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[13]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[14]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[15]  Martin Weiß,et al.  An improvement of the convergence proof of the ADAM-Optimizer , 2018, ArXiv.

[16]  Nei Kato,et al.  The Deep Learning Vision for Heterogeneous Network Traffic Control: Proposal, Challenges, and Future Perspective , 2017, IEEE Wireless Communications.

[17]  Deng Cai,et al.  AdaDB: An adaptive gradient method with data-dependent bound , 2021, Neurocomputing.

[18]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Zhen Lu,et al.  A Distributed Neural Network Training Method Based on Hybrid Gradient Computing , 2020, Scalable Comput. Pract. Exp..

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Fanfan Ma,et al.  New Gradient-Weighted Adaptive Gradient Methods With Dynamic Constraints , 2020, IEEE Access.