论文信息 - Improved Analysis and Rates for Variance Reduction under Without-replacement Sampling Orders

Improved Analysis and Rates for Variance Reduction under Without-replacement Sampling Orders

When applying a stochastic algorithm, one must choose an order to draw samples. The practical choices are without-replacement sampling orders, which are empirically faster and more cache-friendly than uniform-iid-sampling but often have inferior theoretical guarantees. Without-replacement sampling is well understood only for SGD without variance reduction. In this paper, we will improve the convergence analysis and rates of variance reduction under without-replacement sampling orders for composite finite-sum minimization. Our results are in two-folds. First, we develop a damped variant of Finito called Prox-DFinito and establish its convergence rates with random reshuffling, cyclic sampling, and shuffling-once, under both convex and strongly convex scenarios. These rates match full-batch gradient descent and are state-of-the-art compared to the existing results for without-replacement sampling with variance-reduction. Second, our analysis can gauge how the cyclic order will influence the rate of cyclic sampling and, thus, allows us to derive the optimal fixed ordering. In the highly data-heterogeneous scenario, Prox-DFinito with optimal cyclic sampling can attain a sample-size-independent convergence rate, which, to our knowledge, is the first result that can match with uniform-iid-sampling with variance reduction. We also propose a practical method to discover the optimal cyclic ordering numerically.

[1] Asuman E. Ozdaglar,et al. On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[2] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[3] Chih-Jen Lin,et al. Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[4] Prateek Jain,et al. SGD without Replacement: Sharper Rates for General Smooth Convex Functions , 2019, ICML.

[5] Suvrit Sra,et al. Random Shuffling Beats SGD after Finite Epochs , 2018, ICML.

[6] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[7] Qing Liao,et al. General Proximal Incremental Aggregated Gradient Algorithms: Better and Novel Results under General Scheme , 2019, NeurIPS.

[8] Aryan Mokhtari,et al. Surpassing Gradient Descent Provably: A Cyclic Incremental Method with Linear Convergence Rate , 2016, SIAM J. Optim..

[9] Ryan A. Rossi,et al. The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[10] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[11] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[12] Youngsuk Park,et al. Linear convergence of cyclic SAGA , 2018, Optim. Lett..

[13] Asuman E. Ozdaglar,et al. Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[14] Asuman E. Ozdaglar,et al. Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods , 2016, SIAM J. Optim..

[15] Peter Richtárik,et al. MISO is Making a Comeback With Better Proofs and Rates , 2019, 1906.01474.

[16] S. Frick,et al. Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[17] Konstantin Mishchenko,et al. Random Reshuffling: Simple Analysis with Vast Improvements , 2020, NeurIPS.

[18] Michael A. Saunders,et al. Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[19] Ohad Shamir,et al. How Good is SGD with Random Shuffling? , 2019, COLT.

[20] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[21] H. Robbins. A Stochastic Approximation Method , 1951 .

[22] Wotao Yin,et al. Cyclic Coordinate-Update Algorithms for Fixed-Point Problems: Analysis and Applications , 2016, SIAM J. Sci. Comput..

[23] Tong Zhang,et al. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[24] Peter Richt'arik,et al. Random Reshuffling with Variance Reduction: New Analysis and Better Rates , 2021, UAI.

[25] Gonzalo Mateos,et al. Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[26] Wotao Yin,et al. Walk Proximal Gradient: An Energy-Efficient Algorithm for Consensus Optimization , 2019, IEEE Internet of Things Journal.

[27] Dimitris Papailiopoulos,et al. Closing the convergence gap of SGD without replacement , 2020, ICML.

[28] Ali H. Sayed,et al. Stochastic gradient descent with finite samples sizes , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[29] Konstantin Mishchenko,et al. Proximal and Federated Random Reshuffling , 2021, ICML.

[30] A. Ozdaglar,et al. A Stronger Convergence Result on the Proximal Incremental Aggregated Gradient Method , 2016, 1611.08022.

[31] Ali H. Sayed,et al. Variance-Reduced Stochastic Learning Under Random Reshuffling , 2017, IEEE Transactions on Signal Processing.

[32] Wotao Yin,et al. Tight coefficients of averaged operators via scaled relative graph , 2019 .

[33] L. Bottou. Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[34] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[35] Justin Domke,et al. Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[36] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[37] L. Deng,et al. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[38] Yi Zhou,et al. Understanding the Impact of Model Incoherence on Convergence of Incremental SGD with Random Reshuffle , 2020, ICML.

[39] Julien Mairal,et al. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[40] Ali H. Sayed,et al. Stochastic Learning Under Random Reshuffling With Constant Step-Sizes , 2018, IEEE Transactions on Signal Processing.