Heterogeneous Computation across Heterogeneous Workers

Coded distributed computing framework enables large-scale machine learning (ML) models to be trained efficiently in a distributed manner, while mitigating the straggler effect. In this work, we consider a multi-task assignment problem in a coded distributed computing system, where multiple masters, each with a different matrix multiplication task, assign computation tasks to workers with heterogeneous computing capabilities. Both dedicated and probabilistic worker assignment models are considered, with the objective of minimizing the average completion time of all computations. For dedicated worker assignment, greedy algorithms are proposed and the corresponding optimal load allocation is derived based on the Lagrange multiplier method. For probabilistic assignment, successive convex approximation method is used to solve the non-convex optimization problem. Simulation results show that the proposed algorithms reduce the completion time by 150% over uncoded scheme, and 30% over an unbalanced coded scheme.

[1]  Mohammad Ali Maddah-Ali,et al.  A Unified Coding Framework for Distributed Computing with Straggling Servers , 2016, 2016 IEEE Globecom Workshops (GC Wkshps).

[2]  Amir Salman Avestimehr,et al.  Coded Computation Over Heterogeneous Clusters , 2019, IEEE Transactions on Information Theory.

[3]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[4]  Farzin Haddadpour,et al.  On the Optimal Recovery Threshold of Coded Matrix Multiplication , 2020, IEEE Transactions on Information Theory.

[5]  Amin Saberi,et al.  An approximation algorithm for max-min fair allocation of indivisible goods , 2007, STOC '07.

[6]  Deniz Gündüz,et al.  Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers , 2018, 2019 IEEE International Symposium on Information Theory (ISIT).

[7]  D. K. Friesen,et al.  SCHEDULING TO MAXIMIZE THE MINIMUM PROCESSOR FINISH TIME IN A MULTIPROCESSOR SYSTEM , 1982 .

[8]  Nuwan S. Ferdinand,et al.  Hierarchical Coded Computation , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[9]  Mehdi Bennis,et al.  Wireless Network Intelligence at the Edge , 2018, Proceedings of the IEEE.

[10]  Deniz Gündüz,et al.  Computation Scheduling for Distributed Machine Learning with Straggling Workers , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Sanjeev Khanna,et al.  On Allocating Goods to Maximize Fairness , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[12]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.

[13]  Francisco Facchinei,et al.  Parallel and Distributed Methods for Constrained Nonconvex Optimization—Part I: Theory , 2016, IEEE Transactions on Signal Processing.

[14]  Rubén Ruiz,et al.  Iterated greedy local search methods for unrelated parallel machine scheduling , 2010, Eur. J. Oper. Res..

[15]  Emina Soljanin,et al.  On the Effect of Task-to-Worker Assignment in Distributed Computing Systems with Stragglers , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).