Adathm: Adaptive Gradient Method Based on Estimates of Third-Order Moments

Deep learning has been widely used in the field of data aggregation and fusion. As a significant aspect of deep learning, stochastic optimization algorithm affects the operating efficiency and final effect. Adaptive optimization methods such as Adagrad, RMSprop, Adam, which have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Nevertheless, unstable and extreme learning rates may fail to converge to an optimal solution (or a critical point in nonconvex settings). In order to reduce the impact of unsuitable learning rate, we proposed a new method that we apply third-order moments to Adam, which is called Adathm. We also introduce the ideal of dynamic bounds on learning rates and endow the proposed method with "long-term memory" of past gradients. Our preliminary experimental results show that our proposed algorithm can fix the convergence issues and compares favorably to other stochastic optimization methods in some real applications.

[1]  H. Robbins A Stochastic Approximation Method , 1951 .

[2]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[3]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[4]  M. Young The technical writer's handbook : writing with style and clarity , 1989 .

[5]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[6]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[13]  Philipp Hennig,et al.  Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.

[14]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[15]  Yong Yu,et al.  AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.

[16]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[17]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[18]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[19]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.