Coded Computation Over Heterogeneous Clusters

In large-scale distributed computing clusters, such as Amazon EC2, there are several types of “system noise” that can result in major degradation of performance: system failures, bottlenecks due to limited communication bandwidth, latency due to straggler nodes, and so on. There have been recent results that demonstrate the impact of coding for efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in homogeneous clusters. In this paper, we focus on general heterogeneous distributed computing clusters consist of a variety of computing machines with different capabilities. We propose a coding framework for speeding up distributed computing in heterogeneous clusters by trading redundancy for reducing the latency of computation. In particular, we propose heterogeneous coded matrix multiplication (HCMM) algorithm for performing distributed matrix multiplication over heterogeneous clusters that are provably asymptotically optimal for a broad class of processing time distributions. Moreover, we show that HCMM is unboundedly faster than any uncoded scheme that partitions the total workload among the workers. To demonstrate how the proposed HCMM scheme can be applied in practice, we provide results from numerical studies and Amazon EC2 experiments comparing HCMM with three benchmark load allocation schemes—uniform uncoded, load-balanced uncoded, and uniform coded. In particular, in our numerical studies, HCMM achieves speedups of up to 73%, 56%, and 42%, respectively, over the three benchmark schemes mentioned earlier. Furthermore, we carry out experiments over Amazon EC2 clusters and demonstrate how HCMM can be combined with rateless codes with nearly linear decoding complexity. In particular, we show that HCMM combined with the Luby transform codes can significantly reduce the overall execution time. HCMM is found to be up to 61%, 46%, and 36% faster than the aforementioned three benchmark schemes, respectively. Additionally, we provide a generalization to the problem of optimal load allocation in heterogeneous settings, where we take into account the monetary costs associated with distributed computing clusters. We argue that HCMM is asymptotically optimal for budget-constrained scenarios as well. In particular, we characterize the minimum possible expected cost associated with a computation task over a given cluster of machines. Furthermore, we develop a heuristic algorithm for (HCMM) load allocation for the distributed implementation of budget-limited computation tasks.

[1]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[2]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[3]  Franck Cappello,et al.  Cost-benefit analysis of Cloud Computing versus desktop grids , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.

[5]  Jian Li,et al.  Cost-efficient task scheduling for executing large programs in the cloud , 2013, Parallel Comput..

[6]  Ramtin Pedarsani,et al.  Latency analysis of coded computation schemes over wireless networks , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[7]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[8]  Ravi Tandon,et al.  Information Theoretic Limits of Data Shuffling for Distributed Learning , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[9]  Malhar Chaudhari,et al.  Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[10]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[11]  Amir Salman Avestimehr,et al.  On Heterogeneous Coded Distributed Computing , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[12]  Kangwook Lee,et al.  Matrix sparsification for coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[13]  Emina Soljanin,et al.  Effective Straggler Mitigation: Which Clones Should Attack and When? , 2017, PERV.

[14]  Kannan Ramchandran,et al.  High-dimensional coded matrix multiplication , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[15]  Soummya Kar,et al.  Computing Linear Transformations With Unreliable Components , 2015, IEEE Transactions on Information Theory.

[16]  Christina Fragouli,et al.  Communication vs distributed computation: An alternative trade-off curve , 2017, 2017 IEEE Information Theory Workshop (ITW).

[17]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[18]  Mohammad Ali Maddah-Ali,et al.  Coding for Distributed Fog Computing , 2017, IEEE Communications Magazine.

[19]  Alexandros G. Dimakis,et al.  Gradient Coding , 2016, ArXiv.

[20]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[21]  Raja Lavanya,et al.  Fog Computing and Its Role in the Internet of Things , 2019, Advances in Computer and Electrical Engineering.

[22]  Mohammad Ali Maddah-Ali,et al.  Coded MapReduce , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[23]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[24]  Mario A. Storti,et al.  MPI for Python , 2005, J. Parallel Distributed Comput..

[25]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[26]  Gregory W. Wornell,et al.  Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing , 2015, PERV.

[27]  Albin Severinson,et al.  Block-Diagonal and LT Codes for Distributed Computing With Straggling Servers , 2017, IEEE Transactions on Communications.

[28]  Farzin Haddadpour,et al.  On the optimal recovery threshold of coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29]  Mohammad Ali Maddah-Ali,et al.  Coded TeraSort , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[30]  Ness B. Shroff,et al.  Fundamental Limits of Coded Linear Transform , 2018, ArXiv.

[31]  Jörg Kliewer,et al.  Coded Computation Against Straggling Decoders for Network Function Virtualization , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[32]  Mohammad Ali Maddah-Ali,et al.  Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[33]  Ewa Deelman,et al.  The cost of doing science on the cloud: the Montage example , 2008, HiPC 2008.

[34]  Christina Fragouli,et al.  A Pliable Index Coding Approach to Data Shuffling , 2020, IEEE Transactions on Information Theory.

[35]  Pulkit Grover,et al.  Coded convolution for parallel and distributed computing within a deadline , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[36]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[37]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[38]  Jarek Nabrzyski,et al.  Cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  Dimitris S. Papailiopoulos,et al.  Coded computation for multicore setups , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[40]  Nuwan S. Ferdinand,et al.  Anytime coding for distributed computation , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[41]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[42]  Ulas C. Kozat,et al.  TOFEC: Achieving optimal throughput-delay trade-off of cloud storage using erasure codes , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[43]  Christina Fragouli,et al.  A pliable index coding approach to data shuffling , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[44]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[45]  Artur Andrzejak,et al.  Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances , 2012, IEEE Transactions on Services Computing.

[46]  Amir Salman Avestimehr,et al.  Coded computation over heterogeneous clusters , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).