Coded Elastic Computing

Cloud providers have recently introduced new offerings whereby spare computing resources are accessible at discounts compared to on-demand computing. Exploiting such opportunity is challenging inasmuch as such resources are accessed with low-priority and therefore can elastically leave (through preemption) and join the computation at any time. In this paper, we design a new technique called coded elastic computing enabling distributed computations over elastic resources. The proposed technique allows machines to leave the computation without sacrificing the algorithm-level performance, and, at the same time, flexibly reduce the workload at existing machines when new ones join the computation. Leveraging coded redundancy, our approach is able to achieve similar computational cost as the original (uncoded) method when all machines are present; the cost gracefully increases when machines are preempted and reduces when machines join. The performance of the proposed technique is evaluated on matrix-vector multiplication and linear regression tasks, and shows improvements over existing techniques.

[1]  Salim El Rouayheb,et al.  Staircase codes for secret sharing with optimal communication and read overheads , 2015, 2016 IEEE International Symposium on Information Theory (ISIT).

[2]  Tze Meng Low,et al.  A Unified Coded Deep Neural Network Training Strategy based on Generalized PolyDot codes , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[3]  Min Ye,et al.  Communication-Computation Efficient Gradient Coding , 2018, ICML.

[4]  Soummya Kar,et al.  Computing Linear Transformations With Unreliable Components , 2015, IEEE Transactions on Information Theory.

[5]  A. Salman Avestimehr,et al.  Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding , 2020, IEEE Transactions on Information Theory.

[6]  Soummya Kar,et al.  Coded Iterative Computing using Substitute Decoding , 2018, ArXiv.

[7]  Alexandros G. Dimakis,et al.  Gradient Coding From Cyclic MDS Codes and Expander Graphs , 2017, IEEE Transactions on Information Theory.

[8]  Jörg Kliewer,et al.  Coded Computation Against Straggling Decoders for Network Function Virtualization , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[9]  Pulkit Grover,et al.  Coded convolution for parallel and distributed computing within a deadline , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[10]  Jungwoo Lee,et al.  Secure Distributed Computing With Straggling Servers Using Polynomial Codes , 2019, IEEE Transactions on Information Forensics and Security.

[11]  Jörg Kliewer,et al.  Coded Computation Against Processing Delays for Virtualized Cloud-Based Channel Decoding , 2017, IEEE Transactions on Communications.

[12]  Amir Salman Avestimehr,et al.  Polynomially Coded Regression: Optimal Straggler Mitigation via Data Encoding , 2018, ArXiv.

[13]  Deniz Gündüz,et al.  Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers , 2018, 2019 IEEE International Symposium on Information Theory (ISIT).

[14]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[15]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[16]  Suhas N. Diggavi,et al.  Encoded distributed optimization , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[17]  Dimitris S. Papailiopoulos,et al.  DRACO: Robust Distributed Training via Redundant Gradients , 2018, ICML 2018.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[20]  Amir Salman Avestimehr,et al.  Coded Computing for Distributed Graph Analytics , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[21]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[22]  Christoforos N. Hadjicostis,et al.  Coding approaches to fault tolerance in linear dynamic systems , 2005, IEEE Transactions on Information Theory.

[23]  Aditya Akella,et al.  Dynamic Query Re-Planning using QOOP , 2018, OSDI.

[24]  Kangwook Lee,et al.  Matrix sparsification for coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[25]  Emina Soljanin,et al.  Effective Straggler Mitigation: Which Clones Should Attack and When? , 2017, PERV.

[26]  Soummya Kar,et al.  Coding for a Single Sparse Inverse Problem , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[27]  Dimitris S. Papailiopoulos,et al.  Gradient Coding Using the Stochastic Block Model , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[28]  Jaekyun Moon,et al.  Hierarchical Coding for Distributed Computing , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[29]  Gregory W. Wornell,et al.  Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing , 2015, PERV.

[30]  Nuwan S. Ferdinand,et al.  Anytime coding for distributed computation , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[31]  Lei Li On the arithmetic operational complexity for solving Vandermonde linear equations , 2000 .

[32]  Parimal Parag,et al.  Minimizing latency for secure distributed computing , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[33]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[34]  J. Watts,et al.  Interprocessor collective communication library (InterCom) , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[35]  Yaoqing Yang,et al.  Straggler-Resilient and Communication-Efficient Distributed Iterative Linear Solver , 2018, ArXiv.

[36]  Gregory W. Wornell,et al.  Efficient task replication for fast response times in parallel computation , 2014, SIGMETRICS '14.

[37]  Anja Feldmann,et al.  C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection , 2015, NSDI.

[38]  Amir Salman Avestimehr,et al.  Coded computation over heterogeneous clusters , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[39]  Emina Soljanin,et al.  Straggler Mitigation by Delayed Relaunch of Tasks , 2018, PERV.

[40]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[41]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.

[42]  Tze Meng Low,et al.  Masterless Coded Computing: A Fully-Distributed Coded FFT Algorithm , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[43]  Ness B. Shroff,et al.  Coded Sparse Matrix Multiplication , 2018, ICML.

[44]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[45]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[46]  Michael G. Taylor Reliable information storage in memories designed from unreliable components , 1968 .

[47]  Soummya Kar,et al.  Coded Distributed Computing for Inverse Problems , 2017, NIPS.

[48]  Nuwan S. Ferdinand,et al.  Hierarchical Coded Computation , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[49]  Shivaram Venkataraman,et al.  Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation , 2018, ArXiv.

[50]  Markus Weimer,et al.  Towards Resource-Elastic Machine Learning , 2013 .

[51]  Nicholas Pippenger,et al.  On networks of noisy gates , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[52]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[53]  Malhar Chaudhari,et al.  Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[54]  Arya Mazumdar,et al.  Robust Gradient Descent via Moment Encoding with LDPC Codes , 2018, ArXiv.

[55]  Yaoqing Yang,et al.  An Application of Storage-Optimal MatDot Codes for Coded Matrix Multiplication: Fast k-Nearest Neighbors Estimation , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[56]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[57]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[58]  Amir Salman Avestimehr,et al.  Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[59]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[60]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[61]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[62]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[63]  Dimitris S. Papailiopoulos,et al.  Coded computation for multicore setups , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[64]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[65]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[66]  Pulkit Grover,et al.  Locally Recoverable Coded Matrix Multiplication , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[67]  Li Tang,et al.  Erasure Coding for Distributed Matrix Multiplication for Matrices With Bounded Entries , 2018, IEEE Communications Letters.

[68]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[69]  Albin Severinson,et al.  Block-Diagonal and LT Codes for Distributed Computing With Straggling Servers , 2017, IEEE Transactions on Communications.

[70]  Farzin Haddadpour,et al.  On the optimal recovery threshold of coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[71]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[72]  Kannan Ramchandran,et al.  Straggler-Proofing Massive-Scale Distributed Matrix Multiplication with D-Dimensional Product Codes , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[73]  Kannan Ramchandran,et al.  High-dimensional coded matrix multiplication , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).