Variance-Reduced Stochastic Learning Under Random Reshuffling

Several useful variance-reduced stochastic gradient algorithms, such as SVRG, SAGA, Finito, and SAG, have been proposed to minimize empirical risks with linear convergence properties to the exact minimizer. The existing convergence results assume uniform data sampling with replacement. However, it has been observed in related works that random reshuffling can deliver superior performance over uniform sampling and, yet, no formal proofs or guarantees of exact convergence exist for variance-reduced algorithms under random reshuffling. This paper makes two contributions. First, it provides a theoretical guarantee of linear convergence under random reshuffling for SAGA in the mean-square sense; the argument is also adaptable to other variance-reduced algorithms. Second, under random reshuffling, the article proposes a new amortized variance-reduced gradient (AVRG) algorithm with constant storage requirements compared to SAGA and with balanced gradient computations compared to SVRG. AVRG is also shown analytically to converge linearly.

[1]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[2]  Chao Zhang,et al.  Towards Memory-Friendly Deterministic Incremental Gradient Method , 2018, AISTATS.

[3]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[4]  Mark W. Schmidt,et al.  StopWasting My Gradients: Practical SVRG , 2015, NIPS.

[5]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[6]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[7]  Ali H. Sayed,et al.  On the performance of random reshuffling in stochastic learning , 2017, 2017 Information Theory and Applications Workshop (ITA).

[8]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[9]  Paul Tseng,et al.  Incrementally Updated Gradient Methods for Constrained and Regularized Optimization , 2013, Journal of Optimization Theory and Applications.

[10]  Asuman E. Ozdaglar,et al.  Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods , 2016, SIAM J. Optim..

[11]  Dimitri P. Bertsekas,et al.  A New Class of Incremental Gradient Methods for Least Squares Problems , 1997, SIAM J. Optim..

[12]  Ali H. Sayed,et al.  Stochastic Learning Under Random Reshuffling With Constant Step-Sizes , 2018, IEEE Transactions on Signal Processing.

[13]  Tom Goldstein,et al.  Efficient Distributed SGD with Variance Reduction , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[14]  A. Ozdaglar,et al.  A Stronger Convergence Result on the Proximal Incremental Aggregated Gradient Method , 2016, 1611.08022.

[15]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[16]  Ali H. Sayed,et al.  Efficient Variance-Reduced Learning for Fully Decentralized On-Device Intelligence , 2017, ArXiv.

[17]  Asuman E. Ozdaglar,et al.  On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[18]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[19]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[20]  Ali H. Sayed,et al.  Efficient Variance-Reduced Learning Over Multi-Agent Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[21]  Ali Sayed,et al.  Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[22]  Ali H. Sayed,et al.  Variance-Reduced Stochastic Learning by Networked Agents Under Random Reshuffling , 2017, IEEE Transactions on Signal Processing.

[23]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization , 2016, ArXiv.

[24]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[25]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[26]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[27]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[28]  A. Ozdaglar,et al.  Convergence Rate of Incremental Gradient and Newton Methods , 2015 .

[29]  Alfred O. Hero,et al.  A Convergent Incremental Gradient Method with a Constant Step Size , 2007, SIAM J. Optim..

[30]  Christopher Ré,et al.  Toward a Noncommutative Arithmetic-geometric Mean Inequality: Conjectures, Case-studies, and Consequences , 2012, COLT.

[31]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[32]  Hui Zhang Linear Convergence of the Proximal Incremental Aggregated Gradient Method under Quadratic Growth Condition , 2017, 1702.08166.