Optimizing the Transition Waste in Coded Elastic Computing

Motivated by recently available services in the cloud computing industry, e.g., EC2 Spot or Azure Batch, where spare/low-priority virtual machines are offered at a fraction of the price of the on-demand instances but can be preempted on short notice, we investigate coded computing solutions over elastic resources, where the set of available machines may change in the middle of the computation. Our contributions are two-fold: We first propose an efficient method to minimize the transition waste, a newly introduced concept quantifying the total number of tasks that existing machines have to abandon or take on anew when a machine joins or leaves, for the cyclic elastic task allocation scheme recently proposed in the literature (Yang et al. ISIT’19). We then proceed to generalize such a scheme and introduce new task allocation schemes based on finite geometry that achieve zero transition wastes as long as the number of active machines varies within a fixed range. The proposed solutions can be applied on top of existing coded computing schemes tolerating stragglers.

[1]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[2]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[3]  Son Hoang Dau,et al.  Parity declustering for fault-tolerant storage systems via t-designs , 2012, 2014 IEEE International Conference on Big Data (Big Data).

[4]  Farzin Haddadpour,et al.  On the optimal recovery threshold of coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[5]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[6]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[7]  Ravindra K. Ahuja,et al.  Network Flows , 2011 .

[8]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[9]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[10]  Shivaram Venkataraman,et al.  Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation , 2018, ArXiv.

[11]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[12]  P. Hall On Representatives of Subsets , 1935 .

[13]  Amir Salman Avestimehr,et al.  Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[14]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[15]  Vito Napolitano,et al.  Tactical (de-)compositions of symmetric configurations , 2009, Discret. Math..

[16]  Randy H. Katz,et al.  Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling , 2016, J. Mach. Learn. Res..

[17]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[18]  Mohammad Ali Maddah-Ali,et al.  Coded MapReduce , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[19]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[20]  Soummya Kar,et al.  Coded Elastic Computing , 2018, 2019 IEEE International Symposium on Information Theory (ISIT).

[21]  John C. S. Lui,et al.  Performance Analysis of Disk Arrays under Failure , 1990, VLDB.

[22]  Garth A. Gibson,et al.  Parity declustering for continuous operation in redundant disk arrays , 1992, ASPLOS V.