A Practical Algorithm Design and Evaluation for Heterogeneous Elastic Computing with Stragglers

Our extensive real measurements over Amazon EC2 show that the virtual instances often have different computing speeds even if they share the same configurations. This motivates us to study heterogeneous Coded Storage Elastic Computing (CSEC) systems where machines, with different computing speeds, join and leave the network arbitrarily over different computing steps. In CSEC systems, a Maximum Distance Separable (MDS) code is used for coded storage such that the file placement does not have to be redefined with each elastic event. Computation assignment algorithms are used to minimize the computation time given computation speeds of different machines. While previous studies of heterogeneous CSEC do not include stragglers-the slow machines during the computation, we develop a new framework in heterogeneous CSEC that introduces straggler tolerance. Based on this framework, we design a novel algorithm using our previously proposed approach for heterogeneous CSEC such that the system can handle any subset of stragglers of a specified size while minimizing the computation time. Furthermore, we establish a trade-off in computation time and straggler tolerance. Another major limitation of existing CSEC designs is the lack of practical evaluations using real applications. In this paper, we evaluate the performance of our designs on Amazon EC2 for applications of the power iteration and linear regression. Evaluation results show that the proposed heterogeneous CSEC algorithms outperform the state-of-the-art designs by more than 30%.

[1]  Stark C. Draper,et al.  Hierarchical Coded Elastic Computing , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ness B. Shroff,et al.  Fundamental Limits of Coded Linear Transform , 2018, ArXiv.

[3]  Emina Soljanin,et al.  Straggler Mitigation by Delayed Relaunch of Tasks , 2018, PERV.

[4]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[5]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[6]  Mohammad Ali Maddah-Ali,et al.  Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[7]  Nuwan S. Ferdinand,et al.  Anytime coding for distributed computation , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[8]  Kangwook Lee,et al.  Matrix sparsification for coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[9]  Rong-Rong Chen,et al.  Heterogeneous Computation Assignments in Coded Elastic Computing , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[10]  Arya Mazumdar,et al.  Robust Gradient Descent via Moment Encoding with LDPC Codes , 2018, ArXiv.

[11]  Soummya Kar,et al.  Coded Elastic Computing , 2018, 2019 IEEE International Symposium on Information Theory (ISIT).

[12]  Jörg Kliewer,et al.  Coded Computation Against Processing Delays for Virtualized Cloud-Based Channel Decoding , 2017, IEEE Transactions on Communications.

[13]  Giuseppe Caire,et al.  Distributed Linearly Separable Computation , 2020, IEEE Transactions on Information Theory.

[14]  Zahir Tari,et al.  Optimizing the Transition Waste in Coded Elastic Computing , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[15]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[16]  Min Ye,et al.  Communication-Computation Efficient Gradient Coding , 2018, ICML.

[17]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).