ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

We present ErasureHead, a new approach for distributed gradient descent (GD) that mitigates system delays by employing approximate gradient coding. Gradient coded distributed GD uses redundancy to exactly recover the gradient at each iteration from a subset of compute nodes. ErasureHead instead uses approximate gradient codes to recover an inexact gradient at each iteration, but with higher delay tolerance. Unlike prior work on gradient coding, we provide a performance analysis that combines both delay and convergence guarantees. We establish that down to a small noise floor, ErasureHead converges as quickly as distributed GD and has faster overall runtime under a probabilistic delay model. We conduct extensive experiments on real world datasets and distributed clusters and demonstrate that our method can lead to significant speedups over both standard and gradient coded GD.

[1]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[2]  Alexandros G. Dimakis,et al.  Gradient Coding From Cyclic MDS Codes and Expander Graphs , 2017, IEEE Transactions on Information Theory.

[3]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[4]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[5]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[6]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[7]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[8]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[9]  Dimitris S. Papailiopoulos,et al.  Gradient Coding via the Stochastic Block Model , 2018, ArXiv.

[10]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[11]  Suhas N. Diggavi,et al.  Encoded distributed optimization , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[12]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[13]  Min Ye,et al.  Communication-Computation Efficient Gradient Coding , 2018, ICML.

[14]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[15]  Dimitris S. Papailiopoulos,et al.  Coded computation for multicore setups , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[16]  Mark W. Schmidt,et al.  Ju n 20 18 Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Lojasiewicz Condition , 2018 .

[17]  Dimitris S. Papailiopoulos,et al.  Approximate Gradient Coding via Sparse Random Graphs , 2017, ArXiv.

[18]  A. Salman Avestimehr,et al.  Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding , 2020, IEEE Transactions on Information Theory.

[19]  Lisandro Dalcin,et al.  Parallel distributed computing using Python , 2011 .

[20]  Ameet Talwalkar,et al.  Paleo: A Performance Model for Deep Neural Networks , 2016, ICLR.

[21]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.

[22]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[23]  Mohammad Ali Maddah-Ali,et al.  Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[24]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[25]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[26]  Jaekyun Moon,et al.  Hierarchical Coding for Distributed Computing , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[27]  Amir Salman Avestimehr,et al.  Near-Optimal Straggler Mitigation for Distributed Gradient Methods , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[28]  Dimitris S. Papailiopoulos,et al.  DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[29]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[30]  Farzin Haddadpour,et al.  On the optimal recovery threshold of coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[31]  Dan Alistarh,et al.  Synchronous Multi-GPU Deep Learning with Low-Precision Communication: An Experimental Study , 2018 .

[32]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[33]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..