论文信息 - Tree Gradient Coding

Tree Gradient Coding

Scaling up distributed machine learning systems face two major bottlenecks – delays due to stragglers and limited communication bandwidth. Recently, a number of coding theoretic strategies have been proposed for mitigating these bottlenecks. In particular, the Gradient Coding (GC) scheme was proposed to speed up distributed gradient descent algorithm in a synchronous master-worker setting by providing robustness to stragglers. A major drawback of the master-worker architecture for distributed learning is however, the bandwidth contention at the master, which can significantly deteriorate the performance as the cluster size increases. In this paper, we propose a new framework named Tree Gradient Coding (TGC) for distributed gradient aggregation, which parallelizes communication over a tree topology while providing straggler robustness. As our main contribution, we characterize the minimum computation load for TGC for a given tree topology and straggler resiliency, and design a tree gradient coding algorithm that achieves this optimal computation load. Furthermore, we provide results from experiments over Amazon EC2, where TGC speeds up the training time by up to 18.8× in comparison to GC.

[1] Shivaram Venkataraman,et al. Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation , 2018, ArXiv.

[2] Amir Salman Avestimehr,et al. Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[3] Ravi Tandon,et al. Information Theoretic Limits of Data Shuffling for Distributed Learning , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[4] Amir Salman Avestimehr,et al. Timely-Throughput Optimal Coded Computing over Cloud Networks , 2019, MobiHoc.

[5] Alexandros G. Dimakis,et al. Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[6] John F. Canny,et al. Kylix: A Sparse Allreduce for Commodity Clusters , 2014, 2014 43rd International Conference on Parallel Processing.

[7] Scott Shenker,et al. Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[8] Min Ye,et al. Communication-Computation Efficient Gradient Coding , 2018, ICML.

[9] A. Salman Avestimehr,et al. A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[10] Amir Salman Avestimehr,et al. Coded Computing for Distributed Graph Analytics , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[11] Kannan Ramchandran,et al. High-dimensional coded matrix multiplication , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[12] Suhas N. Diggavi,et al. Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[13] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..

[14] Nam Sung Kim,et al. Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training , 2018, NeurIPS.

[15] Babak Hassibi,et al. Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[16] Mohammad Ali Maddah-Ali,et al. Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[17] Amir Salman Avestimehr,et al. Coded Computation Over Heterogeneous Clusters , 2019, IEEE Transactions on Information Theory.

[18] Amir Salman Avestimehr,et al. CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning , 2019, ArXiv.

[19] Torsten Hoefler,et al. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations , 2014, Supercomput. Front. Innov..

[20] Pulkit Grover,et al. “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[21] Ramtin Pedarsani,et al. Latency analysis of coded computation schemes over wireless networks , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[22] Dimitris S. Papailiopoulos,et al. Speeding up distributed machine learning using codes , 2016, ISIT.

[23] Christina Fragouli,et al. Communication vs distributed computation: An alternative trade-off curve , 2017, 2017 IEEE Information Theory Workshop (ITW).