Timely-Throughput Optimal Coded Computing over Cloud Networks

In modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of computing resources. Meanwhile, there is significantly increasing demand for timely and event-driven services with deadline constraints. Motivated by measurements over Amazon EC2 clusters, we consider a two-state Markov model for variability of computing speed in cloud networks. In this model, each worker can be either in a good state or a bad state in terms of the computation speed, and the transition between these states is modeled as a Markov chain which is unknown to the scheduler. We then consider a Coded Computing framework, in which the data is possibly encoded and stored at the worker nodes in order to provide robustness against nodes that may be in a bad state. With timely computation requests submitted to the system with computation deadlines, our goal is to design the optimal computation-load allocation scheme and the optimal data encoding scheme that maximize the timely computation throughput (i.e, the average number of computation tasks that are accomplished before their deadline). Our main result is the development of a dynamic computation strategy called Lagrange Estimate-and-Allocate (LEA) strategy, which achieves the optimal timely computation throughput. It is shown that compared to the static allocation strategy, LEA improves the timely computation throughput by 1.4x ~ 17.5x in various scenarios via simulations and by 1.27x ~ 6.5x in experiments over Amazon EC2 clusters.

[1]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[2]  Nuwan S. Ferdinand,et al.  Hierarchical Coded Computation , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[3]  Lisandro Dalcin,et al.  Parallel distributed computing using Python , 2011 .

[4]  Leandros Tassiulas,et al.  Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks , 1990, 29th IEEE Conference on Decision and Control.

[5]  Amir Salman Avestimehr,et al.  Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[6]  Amir Salman Avestimehr,et al.  Timely Throughput of Heterogeneous Wireless Networks: Fundamental Limits and Algorithms , 2012, IEEE Transactions on Information Theory.

[7]  Vivek S. Borkar,et al.  A Theory of QoS for Wireless , 2009, IEEE INFOCOM 2009.

[8]  R. Srikant,et al.  Stable scheduling policies for fading wireless channels , 2005, IEEE/ACM Transactions on Networking.

[9]  R. Srikant,et al.  Fair resource allocation in wireless networks using queue-length-based scheduling and congestion control , 2007, TNET.

[10]  J. G. Dai,et al.  Maximum Pressure Policies in Stochastic Processing Networks , 2005, Oper. Res..

[11]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[12]  Amir Salman Avestimehr,et al.  Coded Computation Over Heterogeneous Clusters , 2019, IEEE Transactions on Information Theory.

[13]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[14]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[15]  Dimitris S. Papailiopoulos,et al.  DRACO: Robust Distributed Training via Redundant Gradients , 2018, ICML 2018.

[16]  Amir Salman Avestimehr,et al.  Coded Computing for Distributed Graph Analytics , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[17]  Kannan Ramchandran,et al.  High-dimensional coded matrix multiplication , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[18]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[19]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[20]  R. Srikant,et al.  Stochastic models of load balancing and scheduling in cloud computing clusters , 2012, 2012 Proceedings IEEE INFOCOM.

[21]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[22]  Jean C. Walrand,et al.  Robust scheduling for flexible processing networks , 2017, Advances in Applied Probability.

[23]  Donald F. Towsley,et al.  Acyclic fork-join queuing networks , 1989, JACM.

[24]  Parimal Parag,et al.  Minimizing latency for secure distributed computing , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[25]  Mohammad Ali Maddah-Ali,et al.  Coding for Distributed Fog Computing , 2017, IEEE Communications Magazine.

[26]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[27]  Rizos Sakellariou,et al.  Stochastic DAG scheduling using a Monte Carlo approach , 2013, J. Parallel Distributed Comput..

[28]  Nima Jafari Navimipour,et al.  Deadline constrained task scheduling in the cloud computing using a discrete firefly algorithm , 2017, Int. J. Next Gener. Comput..

[29]  Bryan Ng,et al.  Scheduling deadline constrained scientific workflows on dynamically provisioned cloud resources , 2017, Future Gener. Comput. Syst..

[30]  Rajesh Sundaresan,et al.  Augmenting max-weight with explicit learning for wireless scheduling with switching costs , 2017, INFOCOM 2017.

[31]  A. Salman Avestimehr,et al.  Communication-Aware Scheduling of Serial Tasks for Dispersed Computing , 2019, IEEE/ACM Transactions on Networking.

[32]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[33]  Dimitris S. Papailiopoulos,et al.  DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[34]  Suhas N. Diggavi,et al.  Encoded distributed optimization , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[35]  Eytan Modiano,et al.  Dynamic power allocation and routing for time varying wireless networks , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[36]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.