Learning Under Delayed Feedback: Implicitly Adapting to Gradient Delays

We consider stochastic convex optimization problems, where several machines act asynchronously in parallel while sharing a common memory. We propose a robust training method for the constrained setting and derive non asymptotic convergence guarantees that do not depend on prior knowledge of update delays, objective smoothness, and gradient variance. Conversely, existing methods for this setting crucially rely on this prior knowledge, which render them unsuitable for essentially all shared-resources computational environments, such as clouds and data centers. Concretely, existing approaches are unable to accommodate changes in the delays which result from dynamic allocation of the machines, while our method implicitly adapts to such changes.

[1]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[2]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[3]  Peter Richtárik,et al.  Distributed Mini-Batch SDCA , 2015, ArXiv.

[4]  Ashok Cutkosky,et al.  Anytime Online-to-Batch, Optimism and Acceleration , 2019, ICML.

[5]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[6]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[7]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[8]  Tim Verbelen,et al.  A Survey on Distributed Machine Learning , 2019, ACM Comput. Surv..

[9]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[10]  Zhengyuan Zhou,et al.  Delay-Adaptive Distributed Stochastic Optimization , 2020, AAAI.

[11]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[12]  Mehryar Mohri,et al.  Accelerating Online Convex Optimization via Adaptive Prediction , 2016, AISTATS.

[13]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[14]  Ohad Shamir,et al.  A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates , 2018, ALT.

[15]  Li Fei-Fei,et al.  Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go? , 2018, ICML.

[16]  Csaba Szepesvári,et al.  A simpler approach to accelerated optimization: iterative averaging meets optimism , 2020, ICML.

[17]  Parijat Dube,et al.  Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.

[18]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[19]  Alexander J. Smola,et al.  AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization , 2015, ArXiv.

[20]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[21]  Fabian Pedregosa,et al.  Improved asynchronous parallel optimization analysis for stochastic incremental methods , 2018, J. Mach. Learn. Res..

[22]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[23]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[24]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[25]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[26]  Matthew J. Streeter,et al.  Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning , 2014, NIPS.

[27]  Sai Praneeth Karimireddy The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Updates , 2020 .

[28]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[30]  Karthik Sridharan,et al.  Optimization, Learning, and Games with Predictable Sequences , 2013, NIPS.

[31]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[32]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[33]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[34]  Aaron Klein,et al.  BOHB: Robust and Efficient Hyperparameter Optimization at Scale , 2018, ICML.

[35]  Volkan Cevher,et al.  UniXGrad: A Universal, Adaptive Algorithm with Optimal Guarantees for Constrained Optimization , 2019, NeurIPS.

[36]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[37]  Lijun Zhang,et al.  Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the O(1/T) Convergence Rate , 2019, COLT.

[38]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.