Shuffling Gradient-Based Methods with Momentum

We combine two advanced ideas widely used in optimization for machine learning: shuffling strategy and momentum technique to develop a novel shuffling gradient-based method with momentum to approximate a stationary point of non-convex finite-sum minimization problems. While our method is inspired by momentum techniques, its update is significantly different from existing momentum-based methods. We establish that our algorithm achieves a state-of-the-art convergence rate for both constant and diminishing learning rates under standard assumptions (i.e., $L$-smoothness and bounded variance). When the shuffling strategy is fixed, we develop another new algorithm that is similar to existing momentum methods. This algorithm covers the single-shuffling and incremental gradient schemes as special cases. We prove the same convergence rate of this algorithm under the $L$-smoothness and bounded gradient assumptions. We demonstrate our algorithms via numerical simulations on standard datasets and compare them with existing shuffling methods. Our tests have shown encouraging performance of the new algorithms.

[1]  Ali H. Sayed,et al.  Convergence of Variance-Reduced Stochastic Learning under Random Reshuffling , 2017, ArXiv.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Dimitri P. Bertsekas,et al.  Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[4]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[5]  Anthony Man-Cho So,et al.  Incremental Methods for Weakly Convex Optimization , 2019, ArXiv.

[6]  Konstantin Mishchenko,et al.  Random Reshuffling: Simple Analysis with Vast Improvements , 2020, NeurIPS.

[7]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[8]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[9]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[10]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[11]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[12]  Prateek Jain,et al.  SGD without Replacement: Sharper Rates for General Smooth Convex Functions , 2019, ICML.

[13]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[16]  Marten van Dijk,et al.  A Unified Convergence Analysis for Shuffling-Type Gradient Methods , 2020, ArXiv.

[17]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[18]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[19]  Tie-Yan Liu,et al.  Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling , 2017, Neurocomputing.

[20]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[21]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[22]  Peter Richtárik,et al.  New Convergence Aspects of Stochastic Gradient Algorithms , 2018, J. Mach. Learn. Res..

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Ohad Shamir,et al.  How Good is SGD with Random Shuffling? , 2019, COLT.

[25]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Dimitri P. Bertsekas,et al.  Incremental Subgradient Methods for Nondifferentiable Optimization , 2001, SIAM J. Optim..

[27]  Mingyi Hong,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[28]  Lam M. Nguyen,et al.  ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization , 2019, J. Mach. Learn. Res..

[29]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[30]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[31]  Lam M. Nguyen,et al.  A Hybrid Stochastic Optimization Framework for Stochastic Composite Nonconvex Optimization , 2019, ArXiv.

[32]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[33]  Ali H. Sayed,et al.  Variance-Reduced Stochastic Learning Under Random Reshuffling , 2017, IEEE Transactions on Signal Processing.

[34]  Dimitris Papailiopoulos,et al.  Closing the convergence gap of SGD without replacement , 2020, ICML.

[35]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[36]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[37]  Suvrit Sra,et al.  Random Shuffling Beats SGD after Finite Epochs , 2018, ICML.

[38]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[39]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[40]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[41]  Francesco Orabona,et al.  Exponential Step Sizes for Non-Convex Optimization , 2020, ArXiv.

[42]  Richard G. Baraniuk,et al.  Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent , 2020, ArXiv.

[43]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[44]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[45]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[46]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.