RMSprop converges with proper hyper-parameter

Despite the existence of divergence examples, RMSprop remains one of the most popular algorithms in machine learning. Towards closing the gap between theory and practice, we prove that RMSprop converges with proper choice of hyperparameters under certain conditions. More specifically, we prove that when the hyper-parameter β2 is close enough to 1, RMSprop and its random shuffling version converge to a bounded region in general, and to critical points in the interpolation regime. It is worth mentioning that our results do not depend on “bounded gradient" assumption, which is often the key assumption utilized by existing theoretical work for Adam-type adaptive gradient method. Removing this assumption allows us to establish a phase transition from divergence to non-divergence for RMSprop. Finally, based on our theory, we conjecture that in practice there is a critical threshold β∗ 2 , such that RMSprop generates reasonably good results only if 1 > β2 ≥ β∗ 2 . We provide empirical evidence for such a phase transition in our numerical experiments.

[1]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[2]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[3]  Alexia Jolicoeur-Martineau,et al.  The relativistic discriminator: a key element missing from standard GAN , 2018, ICLR.

[4]  Bin Dong,et al.  Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate , 2019, IJCAI.

[5]  Soham De,et al.  Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration , 2018, 1807.06766.

[6]  Oliver Wang,et al.  MSG-GAN: Multi-Scale Gradients for Generative Adversarial Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[8]  Wotao Yin,et al.  An Improved Analysis of Stochastic Gradient Descent with Momentum , 2020, NeurIPS.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[11]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[12]  Sepp Hochreiter,et al.  First Order Generative Adversarial Networks , 2018, ICML.

[13]  Stefan Winkler,et al.  The Unusual Effectiveness of Averaging in GAN Training , 2018, ICLR.

[14]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[15]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[16]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[17]  Li Shen,et al.  A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Francis Bach,et al.  On the Convergence of Adam and Adagrad , 2020, ArXiv.

[19]  Yong Yu,et al.  AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.

[20]  Francis Bach,et al.  Explicit Regularization of Stochastic Gradient Methods through Duality , 2020, AISTATS.

[21]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[22]  Mingyi Hong,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.