Improved Analysis and Rates for Variance Reduction under Without-replacement Sampling Orders

When applying a stochastic algorithm, one must choose an order to draw samples. The practical choices are without-replacement sampling orders, which are empirically faster and more cache-friendly than uniform-iid-sampling but often have inferior theoretical guarantees. Without-replacement sampling is well understood only for SGD without variance reduction. In this paper, we will improve the convergence analysis and rates of variance reduction under without-replacement sampling orders for composite finite-sum minimization. Our results are in two-folds. First, we develop a damped variant of Finito called Prox-DFinito and establish its convergence rates with random reshuffling, cyclic sampling, and shuffling-once, under both convex and strongly convex scenarios. These rates match full-batch gradient descent and are state-of-the-art compared to the existing results for without-replacement sampling with variance-reduction. Second, our analysis can gauge how the cyclic order will influence the rate of cyclic sampling and, thus, allows us to derive the optimal fixed ordering. In the highly data-heterogeneous scenario, Prox-DFinito with optimal cyclic sampling can attain a sample-size-independent convergence rate, which, to our knowledge, is the first result that can match with uniform-iid-sampling with variance reduction. We also propose a practical method to discover the optimal cyclic ordering numerically.

[1]  Asuman E. Ozdaglar,et al.  On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[2]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[3]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[4]  Prateek Jain,et al.  SGD without Replacement: Sharper Rates for General Smooth Convex Functions , 2019, ICML.

[5]  Suvrit Sra,et al.  Random Shuffling Beats SGD after Finite Epochs , 2018, ICML.

[6]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[7]  Qing Liao,et al.  General Proximal Incremental Aggregated Gradient Algorithms: Better and Novel Results under General Scheme , 2019, NeurIPS.

[8]  Aryan Mokhtari,et al.  Surpassing Gradient Descent Provably: A Cyclic Incremental Method with Linear Convergence Rate , 2016, SIAM J. Optim..

[9]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[10]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[11]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[12]  Youngsuk Park,et al.  Linear convergence of cyclic SAGA , 2018, Optim. Lett..

[13]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[14]  Asuman E. Ozdaglar,et al.  Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods , 2016, SIAM J. Optim..

[15]  Peter Richtárik,et al.  MISO is Making a Comeback With Better Proofs and Rates , 2019, 1906.01474.

[16]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[17]  Konstantin Mishchenko,et al.  Random Reshuffling: Simple Analysis with Vast Improvements , 2020, NeurIPS.

[18]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[19]  Ohad Shamir,et al.  How Good is SGD with Random Shuffling? , 2019, COLT.

[20]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[21]  H. Robbins A Stochastic Approximation Method , 1951 .

[22]  Wotao Yin,et al.  Cyclic Coordinate-Update Algorithms for Fixed-Point Problems: Analysis and Applications , 2016, SIAM J. Sci. Comput..

[23]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[24]  Peter Richt'arik,et al.  Random Reshuffling with Variance Reduction: New Analysis and Better Rates , 2021, UAI.

[25]  Gonzalo Mateos,et al.  Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[26]  Wotao Yin,et al.  Walk Proximal Gradient: An Energy-Efficient Algorithm for Consensus Optimization , 2019, IEEE Internet of Things Journal.

[27]  Dimitris Papailiopoulos,et al.  Closing the convergence gap of SGD without replacement , 2020, ICML.

[28]  Ali H. Sayed,et al.  Stochastic gradient descent with finite samples sizes , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[29]  Konstantin Mishchenko,et al.  Proximal and Federated Random Reshuffling , 2021, ICML.

[30]  A. Ozdaglar,et al.  A Stronger Convergence Result on the Proximal Incremental Aggregated Gradient Method , 2016, 1611.08022.

[31]  Ali H. Sayed,et al.  Variance-Reduced Stochastic Learning Under Random Reshuffling , 2017, IEEE Transactions on Signal Processing.

[32]  Wotao Yin,et al.  Tight coefficients of averaged operators via scaled relative graph , 2019 .

[33]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[34]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[35]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[36]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[37]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[38]  Yi Zhou,et al.  Understanding the Impact of Model Incoherence on Convergence of Incremental SGD with Random Reshuffle , 2020, ICML.

[39]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[40]  Ali H. Sayed,et al.  Stochastic Learning Under Random Reshuffling With Constant Step-Sizes , 2018, IEEE Transactions on Signal Processing.