Without-Replacement Sampling for Stochastic Gradient Methods

Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled *with* replacement. In contrast, sampling *without* replacement is far less understood, yet in practice it is very common, often easier to implement, and usually performs better. In this paper, we provide competitive convergence guarantees for without-replacement sampling under several scenarios, focusing on the natural regime of few passes over the data. Moreover, we describe a useful application of these results in the context of distributed optimization with randomly-partitioned data, yielding a nearly-optimal algorithm for regularized least squares (in terms of both communication complexity and runtime complexity) under broad parameter regimes. Our proof techniques combine ideas from stochastic optimization, adversarial online learning and transductive learning theory, and can potentially be applied to other stochastic optimization and learning problems.

[1]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[2]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[3]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[4]  Ran El-Yaniv,et al.  Transductive Rademacher Complexity and Its Applications , 2007, COLT.

[5]  Vincent Nesme,et al.  Note on sampling without replacing from a finite collection of matrices , 2010, ArXiv.

[6]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[7]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[8]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[9]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[10]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[11]  B. Recht,et al.  Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , 2012, 1202.4184.

[12]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[13]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[14]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[15]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[16]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[17]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[18]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[19]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[20]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[21]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[22]  Yuchen Zhang,et al.  Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss , 2015, ArXiv.

[23]  Tengyu Ma,et al.  Distributed Stochastic Variance Reduced Gradient Methods , 2015, ArXiv.

[24]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[25]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.