A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $\tau$ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/\tau$ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions.

[1]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[2]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[3]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[4]  Alexander J. Smola,et al.  AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization , 2015, ArXiv.

[5]  Fabian Pedregosa,et al.  Improved asynchronous parallel optimization analysis for stochastic incremental methods , 2018, J. Mach. Learn. Res..

[6]  Nathan Srebro,et al.  Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[7]  Ohad Shamir,et al.  On Lower and Upper Bounds in Smooth and Strongly Convex Optimization , 2016, J. Mach. Learn. Res..

[8]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[9]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[10]  Victor Y. Pan,et al.  How Bad Are Vandermonde Matrices? , 2015, SIAM J. Matrix Anal. Appl..

[11]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[12]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[13]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[14]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[15]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[16]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[17]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[18]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[19]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, CDC.

[20]  Christopher Ré,et al.  Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care , 2015, NIPS.

[21]  Vivek S. Borkar,et al.  Distributed Asynchronous Incremental Subgradient Methods , 2001 .

[22]  Xiaojing Ye,et al.  Decentralized Consensus Algorithm with Delayed and Stochastic Gradients , 2016, SIAM J. Optim..

[23]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[24]  Hamid Reza Feyzmahdavian,et al.  A delayed proximal gradient method with linear convergence rate , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[25]  Stephen J. Wright,et al.  Behavior of accelerated gradient methods near critical points of nonconvex functions , 2017, Math. Program..

[26]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.