Coded Computation across Shared Heterogeneous Workers with Communication Delay

Distributed computing enables large-scale computation tasks to be processed over multiple workers in parallel. However, the randomness of communication and computation delays across workers causes the straggler effect, which may degrade the performance. Coded computation helps to mitigate the straggler effect, but the amount of redundant load and their assignment to the workers should be carefully optimized. In this work, we consider a multi-master heterogeneous-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation. The goal is to minimize the communication plus computation delay of the slowest task. We propose worker assignment, resource allocation and load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded tasks of either a single master or multiple masters, respectively. Then, the non-convex delay minimization problem is solved by employing the Markov’s inequality-based approximation, Karush-Kuhn-Tucker conditions, and successive convex approximation methods. Through extensive simulations, we show that the proposed algorithms can reduce the task completion delay compared to the benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.

[1]  A. Salman Avestimehr,et al.  A Scalable Framework for Wireless Distributed Computing , 2016, IEEE/ACM Transactions on Networking.

[2]  Meixia Tao,et al.  Coded Computing and Cooperative Transmission for Wireless Distributed Matrix Multiplication , 2021, IEEE Transactions on Communications.

[3]  Deniz Gündüz,et al.  Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers , 2018, 2019 IEEE International Symposium on Information Theory (ISIT).

[4]  Mary Wootters,et al.  Stochastic Gradient Coding for Straggler Mitigation in Distributed Learning , 2019, IEEE Journal on Selected Areas in Information Theory.

[5]  Sennur Ulukus,et al.  Timely Distributed Computation With Stragglers , 2019, IEEE Transactions on Communications.

[6]  Junkyun Choi,et al.  Optimal Load Allocation for Coded Distributed Computation in Heterogeneous Clusters , 2021, IEEE Transactions on Communications.

[7]  Francisco Facchinei,et al.  Parallel and Distributed Methods for Constrained Nonconvex Optimization—Part I: Theory , 2016, IEEE Transactions on Signal Processing.

[8]  Amir Salman Avestimehr,et al.  Tree Gradient Coding , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[9]  Eryk Dutkiewicz,et al.  Joint Coding and Scheduling Optimization for Distributed Learning Over Wireless Edge Networks , 2022, IEEE Journal on Selected Areas in Communications.

[10]  Alan Scheller-Wolf,et al.  Redundancy-d: The Power of d Choices for Redundancy , 2017, Oper. Res..

[11]  Sanjeev Khanna,et al.  On Allocating Goods to Maximize Fairness , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[12]  Mohammad Ali Maddah-Ali,et al.  A Unified Coding Framework for Distributed Computing with Straggling Servers , 2016, 2016 IEEE Globecom Workshops (GC Wkshps).

[13]  Emina Soljanin,et al.  Efficient Redundancy Techniques for Latency Reduction in Cloud Systems , 2015, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[14]  Amir Salman Avestimehr,et al.  Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[15]  Nuwan S. Ferdinand,et al.  Hierarchical Coded Computation , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[16]  Emre Ozfatura,et al.  Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off , 2020, Entropy.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Rubén Ruiz,et al.  Iterated greedy local search methods for unrelated parallel machine scheduling , 2010, Eur. J. Oper. Res..

[19]  Jaekyun Moon,et al.  Coded Distributed Computing over Packet Erasure Channels , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[20]  Jesus Gomez-Vilardebo,et al.  Bivariate Polynomial Coding for Efficient Distributed Matrix Multiplication , 2021, IEEE Journal on Selected Areas in Information Theory.

[21]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[22]  Chunyan Miao,et al.  A Survey of Coded Distributed Computing , 2020, ArXiv.

[23]  D. K. Friesen,et al.  SCHEDULING TO MAXIMIZE THE MINIMUM PROCESSOR FINISH TIME IN A MULTIPROCESSOR SYSTEM , 1982 .

[24]  Li Chen,et al.  Latency Optimization for Coded Computation Straggled by Wireless Transmission , 2020, IEEE Wireless Communications Letters.

[25]  Amir Salman Avestimehr,et al.  Coded computation over heterogeneous clusters , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[26]  Sheng Zhou,et al.  Coded Computation Over Heterogeneous Workers With Random Task Arrivals , 2021, IEEE Communications Letters.

[27]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[28]  Amin Saberi,et al.  An Approximation Algorithm for Max-Min Fair Allocation of Indivisible Goods , 2010, SIAM J. Comput..

[29]  Deniz Gündüz,et al.  Heterogeneous Coded Computation across Heterogeneous Workers , 2019, 2019 IEEE Global Communications Conference (GLOBECOM).

[30]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[31]  Deniz Gündüz,et al.  Computation Scheduling for Distributed Machine Learning with Straggling Workers , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).