Tree Gradient Coding

Scaling up distributed machine learning systems face two major bottlenecks – delays due to stragglers and limited communication bandwidth. Recently, a number of coding theoretic strategies have been proposed for mitigating these bottlenecks. In particular, the Gradient Coding (GC) scheme was proposed to speed up distributed gradient descent algorithm in a synchronous master-worker setting by providing robustness to stragglers. A major drawback of the master-worker architecture for distributed learning is however, the bandwidth contention at the master, which can significantly deteriorate the performance as the cluster size increases. In this paper, we propose a new framework named Tree Gradient Coding (TGC) for distributed gradient aggregation, which parallelizes communication over a tree topology while providing straggler robustness. As our main contribution, we characterize the minimum computation load for TGC for a given tree topology and straggler resiliency, and design a tree gradient coding algorithm that achieves this optimal computation load. Furthermore, we provide results from experiments over Amazon EC2, where TGC speeds up the training time by up to 18.8× in comparison to GC.

[1]  Shivaram Venkataraman,et al.  Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation , 2018, ArXiv.

[2]  Amir Salman Avestimehr,et al.  Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[3]  Ravi Tandon,et al.  Information Theoretic Limits of Data Shuffling for Distributed Learning , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[4]  Amir Salman Avestimehr,et al.  Timely-Throughput Optimal Coded Computing over Cloud Networks , 2019, MobiHoc.

[5]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[6]  John F. Canny,et al.  Kylix: A Sparse Allreduce for Commodity Clusters , 2014, 2014 43rd International Conference on Parallel Processing.

[7]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[8]  Min Ye,et al.  Communication-Computation Efficient Gradient Coding , 2018, ICML.

[9]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[10]  Amir Salman Avestimehr,et al.  Coded Computing for Distributed Graph Analytics , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[11]  Kannan Ramchandran,et al.  High-dimensional coded matrix multiplication , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[12]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[13]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..

[14]  Nam Sung Kim,et al.  Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training , 2018, NeurIPS.

[15]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[16]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[17]  Amir Salman Avestimehr,et al.  Coded Computation Over Heterogeneous Clusters , 2019, IEEE Transactions on Information Theory.

[18]  Amir Salman Avestimehr,et al.  CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning , 2019, ArXiv.

[19]  Torsten Hoefler,et al.  Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations , 2014, Supercomput. Front. Innov..

[20]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[21]  Ramtin Pedarsani,et al.  Latency analysis of coded computation schemes over wireless networks , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[22]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.

[23]  Christina Fragouli,et al.  Communication vs distributed computation: An alternative trade-off curve , 2017, 2017 IEEE Information Theory Workshop (ITW).