Permutation-Based SGD: Is Random Optimal?

A recent line of ground-breaking results for permutation-based SGD has corroborated a widely observed phenomenon: random permutations offer faster convergence than with-replacement sampling. However, is random optimal? We show that this depends heavily on what functions we are optimizing, and the convergence gap between optimal and random permutations can vary from exponential to nonexistent. We first show that for 1-dimensional strongly convex functions, with smooth second derivatives, there exist optimal permutations that offer exponentially faster convergence compared to random. However, for general strongly convex functions, random permutations are optimal. Finally, we show that for quadratic, strongly-convex functions, there are easy-to-construct permutations that lead to accelerated convergence compared to random. Our results suggest that a general convergence characterization of optimal permutations cannot capture the nuances of individual function classes, and can mistakenly indicate that one cannot do much better than random.

[1]  Ohad Shamir,et al.  How Good is SGD with Random Shuffling? , 2019, COLT.

[2]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[3]  B. Recht,et al.  Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , 2012, 1202.4184.

[4]  Konstantin Mishchenko,et al.  Random Reshuffling: Simple Analysis with Vast Improvements , 2020, NeurIPS.

[5]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[6]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[7]  Lam M. Nguyen,et al.  Shuffling Gradient-Based Methods with Momentum , 2020, ArXiv.

[8]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[9]  Prateek Jain,et al.  SGD without Replacement: Sharper Rates for General Smooth Convex Functions , 2019, ICML.

[10]  Suvrit Sra,et al.  Random Shuffling Beats SGD after Finite Epochs , 2018, ICML.

[11]  Lek-Heng Lim,et al.  Recht-Ré Noncommutative Arithmetic-Geometric Mean Conjecture is False , 2020, ICML.

[12]  Suvrit Sra,et al.  SGD with shuffling: optimal rates without component convexity and large epoch requirements , 2020, NeurIPS.

[13]  Ohad Shamir,et al.  Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems , 2021, NeurIPS.

[14]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[15]  Marten van Dijk,et al.  A Unified Convergence Analysis for Shuffling-Type Gradient Methods , 2020, J. Mach. Learn. Res..

[16]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[17]  Christopher Ré,et al.  Parallel stochastic gradient algorithms for large-scale matrix completion , 2013, Mathematical Programming Computation.

[18]  Dimitri P. Bertsekas,et al.  Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[19]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[20]  Ali H. Sayed,et al.  Variance-Reduced Stochastic Learning Under Random Reshuffling , 2017, IEEE Transactions on Signal Processing.

[21]  Asuman E. Ozdaglar,et al.  Convergence Rate of Incremental Gradient and Incremental Newton Methods , 2015, SIAM J. Optim..

[22]  Lam M. Nguyen,et al.  SMG: A Shuffling Gradient-Based Method with Momentum , 2020, ICML.

[23]  Dimitris Papailiopoulos,et al.  Closing the convergence gap of SGD without replacement , 2020, ICML.

[24]  Markus Schneider Probability Inequalities for Kernel Embeddings in Sampling without Replacement , 2016, AISTATS.

[25]  Christopher De Sa Random Reshuffling is Not Always Better , 2020, NeurIPS.