Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum gradient delay $\tau_{\max}$ and show that an $\epsilon$-stationary point is reached after $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{\max}\epsilon^{-1}\right)$ iterations, where $\sigma$ denotes the variance of stochastic gradients. In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1}\right)$ without any change in the algorithm where $\tau_{avg}$ is the average delay, which can be significantly smaller than $\tau_{\max}$. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{avg}\epsilon^{-1}\right)$, and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works. In particular, we show that the heterogeneity term in convergence rate is only affected by the average delay within each worker.

[1]  Blake E. Woodworth,et al.  Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays , 2022, NeurIPS.

[2]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[3]  Hamid Reza Feyzmahdavian,et al.  Delay-adaptive step-sizes for asynchronous learning , 2022, ICML.

[4]  Jia Liu,et al.  Anarchic Federated Learning , 2021, ICML.

[5]  Assaf Schuster,et al.  Learning Under Delayed Feedback: Implicitly Adapting to Gradient Delays , 2021, ArXiv.

[6]  Amit Daniely,et al.  Asynchronous Stochastic Optimization Robust to Arbitrary Delays , 2021, NeurIPS.

[7]  Michael G. Rabbat,et al.  Federated Learning with Buffered Asynchronous Aggregation , 2021, AISTATS.

[8]  Longbo Huang,et al.  Fast Federated Learning in the Presence of Arbitrary Device Unavailability , 2021, NeurIPS.

[9]  Martin Jaggi,et al.  Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates , 2021, AISTATS.

[10]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[11]  Xiangnan He,et al.  A Survey on Large-Scale Machine Learning , 2020, IEEE Transactions on Knowledge and Data Engineering.

[12]  Angelia Nedic,et al.  Distributed Gradient Methods for Convex Machine Learning Problems in Networks: Distributed Optimization , 2020, IEEE Signal Processing Magazine.

[13]  Martin Jaggi,et al.  A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[14]  Shaojie Tang,et al.  Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client Availability , 2020, INFORMS Journal on Computing.

[15]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[16]  John C. Duchi,et al.  Lower bounds for non-convex stochastic optimization , 2019, Mathematical Programming.

[17]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning , 2019, ArXiv.

[18]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[19]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[20]  Hubert Eichner,et al.  Towards Federated Learning at Scale: System Design , 2019, SysML.

[21]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[22]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[23]  Ohad Shamir,et al.  A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates , 2018, ALT.

[24]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[25]  Parijat Dube,et al.  Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.

[26]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[27]  Fabian Pedregosa,et al.  Improved asynchronous parallel optimization analysis for stochastic incremental methods , 2018, J. Mach. Learn. Res..

[28]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[29]  Hamid Reza Feyzmahdavian,et al.  Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server , 2016, ArXiv.

[30]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[31]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[32]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[33]  Alexander J. Smola,et al.  AdaDelay: Delay Adaptive Distributed Stochastic Optimization , 2016, AISTATS.

[34]  Blaise Agüera y Arcas,et al.  Federated Learning of Deep Networks using Model Averaging , 2016, ArXiv.

[35]  Christopher Ré,et al.  Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care , 2015, NIPS.

[36]  Ji Liu,et al.  Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[37]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[38]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[39]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[40]  Matthew J. Streeter,et al.  Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning , 2014, NIPS.

[41]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[42]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[43]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[44]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[45]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[46]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[47]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[48]  Olvi L. Mangasarian,et al.  Backpropagation Convergence via Deterministic Nonmonotone Perturbed Minimization , 1993, NIPS.

[49]  H. Robbins A Stochastic Approximation Method , 1951 .

[50]  Mary Wootters,et al.  Asynchronous Distributed Optimization with Stochastic Delays , 2022, AISTATS.

[51]  Hadrien Hendrikx,et al.  Decentralized Optimization with Heterogeneous Delays: a Continuous-Time Approach , 2021, ArXiv.

[52]  Shiva Prasad Kasiviswanathan,et al.  Federated Learning under Arbitrary Communication Patterns , 2021, ICML.

[53]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .