Transition Waste Optimization for Coded Elastic Computing

Distributed computing, in which a resource-intensive task is divided into subtasks and distributed among different machines, plays a key role in solving large-scale problems. Coded computing is a recently emerging paradigm where redundancy for distributed computing is introduced to alleviate the impact of slow machines (stragglers) on the completion time. We investigate coded computing solutions over elastic resources, where the set of available machines may change in the middle of the computation. This is motivated by recently available services in the cloud computing industry (e.g., EC2 Spot, Azure Batch) where low-priority virtual machines are offered at a fraction of the price of the on- demand instances but can be preempted on short notice. Our contributions are three-fold. We first introduce a new concept called transition waste that quantifies the number of tasks existing machines must abandon or take over when a machine joins/leaves. We then develop an efficient method to minimize the transition waste for the cyclic task allocation scheme recently proposed in the literature (Yang et al. ISIT’19). Finally, we establish a novel solution based on finite geometry achieving zero transition wastes given that the number of active machines varies within a fixed range.

[1]  Rong-Rong Chen,et al.  A Practical Algorithm Design and Evaluation for Heterogeneous Elastic Computing with Stragglers , 2021, 2021 IEEE Global Communications Conference (GLOBECOM).

[2]  Stark C. Draper,et al.  Hierarchical Coded Elastic Computing , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Rong-Rong Chen,et al.  Coded Elastic Computing on Machines With Heterogeneous Storage and Computation Speed , 2020, IEEE Transactions on Communications.

[4]  Shivaram Venkataraman,et al.  Learning-Based Coded Computation , 2020, IEEE Journal on Selected Areas in Information Theory.

[5]  Rong-Rong Chen,et al.  Heterogeneous Computation Assignments in Coded Elastic Computing , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[6]  Soummya Kar,et al.  Coded Elastic Computing , 2018, 2019 IEEE International Symposium on Information Theory (ISIT).

[7]  Amir Salman Avestimehr,et al.  Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[8]  Shivaram Venkataraman,et al.  Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation , 2018, ArXiv.

[9]  Malhar Chaudhari,et al.  Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[10]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[11]  Farzin Haddadpour,et al.  On the optimal recovery threshold of coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[12]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[13]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[14]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[15]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[16]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[17]  Mohammad Ali Maddah-Ali,et al.  Coded MapReduce , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[18]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[19]  Son Hoang Dau,et al.  Parity declustering for fault-tolerant storage systems via t-designs , 2012, 2014 IEEE International Conference on Big Data (Big Data).

[20]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[21]  Vito Napolitano,et al.  Tactical (de-)compositions of symmetric configurations , 2009, Discret. Math..

[22]  Charles J. Colbourn,et al.  Handbook of Combinatorial Designs, Second Edition (Discrete Mathematics and Its Applications) , 2006 .

[23]  Garth A. Gibson,et al.  Parity declustering for continuous operation in redundant disk arrays , 1992, ASPLOS V.

[24]  John C. S. Lui,et al.  Performance Analysis of Disk Arrays under Failure , 1990, VLDB.

[25]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[26]  Randy H. Katz,et al.  Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling , 2016, J. Mach. Learn. Res..

[27]  Inge Li Gørtz,et al.  COMP251: Network flows , 2014 .

[28]  Hai Jin,et al.  Parity Declustering for Continuous Operation in Redundant Disk Arrays , 2002 .

[29]  P. Hall On Representatives of Subsets , 1935 .

[30]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .