Serverless Straggler Mitigation using Local Error-Correcting Codes

Inexpensive cloud services, such as serverless computing, are often vulnerable to straggling nodes that increase end-to-end latency for distributed computation. We propose and implement simple yet principled approaches for straggler mitigation in serverless systems for matrix multiplication and evaluate them on several common applications from machine learning and high-performance computing. The proposed schemes are inspired by error-correcting codes and employ parallel encoding and decoding over the data stored in the cloud using serverless workers. This creates a fully distributed computing framework without using a master node to conduct encoding or decoding, which removes the computation, communication and storage bottleneck at the master. On the theory side, we establish that our proposed scheme is asymptotically optimal in terms of decoding time and provide a lower bound on the number of stragglers it can tolerate with high probability. Through extensive experiments, we show that our scheme outperforms existing schemes such as speculative execution and other coding theoretic methods by at least 25%.

[1]  M. J. D. Powell,et al.  Updating conjugate directions by the BFGS formula , 1987, Math. Program..

[2]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[3]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[4]  R. Eubank,et al.  The Equivalence Between the Cholesky Decomposition and the Kalman Filter , 2002 .

[5]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  P. Sabino Monte Carlo Methods and Path-Generation Techniques for Pricing Multi-Asset Path-Dependent Options , 2007, 0710.0850.

[8]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[9]  Per-Gunnar Martinsson,et al.  A Fast Direct Solver for a Class of Elliptic Partial Differential Equations , 2009, J. Sci. Comput..

[10]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[11]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[13]  James Demmel,et al.  Communication-optimal Parallel and Sequential Cholesky Decomposition , 2009, SIAM J. Sci. Comput..

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[16]  Cheng Huang,et al.  On the Locality of Codeword Symbols , 2011, IEEE Transactions on Information Theory.

[17]  Rowayda A. Sadek,et al.  SVD Based Image Processing Applications: State of The Art, Contributions and Research Challenges , 2012, ArXiv.

[18]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[19]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[20]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[21]  Dimitris S. Papailiopoulos,et al.  Locally Repairable Codes , 2012, IEEE Transactions on Information Theory.

[22]  Patrick Seemann,et al.  Matrix Factorization Techniques for Recommender Systems , 2014 .

[23]  Shilpa Chakravartula,et al.  Complex Networks: Structure and Dynamics , 2014 .

[24]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[25]  Kannan Ramchandran,et al.  High-dimensional coded matrix multiplication , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[26]  Farzin Haddadpour,et al.  On the optimal recovery threshold of coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[27]  David P. Woodruff,et al.  Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[28]  Perry Cheng,et al.  Serverless Computing: Current Trends and Open Problems , 2017, Research Advances in Cloud Computing.

[29]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[30]  Josef Spillner,et al.  FaaSter, Better, Cheaper: The Prospect of Serverless Scientific Computing and HPC , 2017, CARLA.

[31]  Ion Stoica,et al.  Occupy the cloud: distributed computing for the 99% , 2017, SoCC.

[32]  Soummya Kar,et al.  Coded Distributed Computing for Inverse Problems , 2017, NIPS.

[33]  Vipul Gupta,et al.  A sequential approximation framework for coded distributed optimization , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[34]  Kannan Ramchandran,et al.  Straggler-Proofing Massive-Scale Distributed Matrix Multiplication with D-Dimensional Product Codes , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[35]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[36]  Pulkit Grover,et al.  Locally Recoverable Coded Matrix Multiplication , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[37]  Shivaram Venkataraman,et al.  Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation , 2018, ArXiv.

[38]  Soummya Kar,et al.  Coded Iterative Computing using Substitute Decoding , 2018, ArXiv.

[39]  Ion Stoica,et al.  Numpywren: Serverless Linear Algebra , 2018, ArXiv.

[40]  Min Ye,et al.  Communication-Computation Efficient Gradient Coding , 2018, ICML.

[41]  Vipul Gupta,et al.  OverSketch: Approximate Matrix Multiplication for the Cloud , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[42]  Mert Pilanci,et al.  Polar Coded Distributed Matrix Multiplication , 2019, ArXiv.

[43]  David A. Patterson,et al.  Cloud Programming Simplified: A Berkeley View on Serverless Computing , 2019, ArXiv.

[44]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[45]  Joseph M. Hellerstein,et al.  Serverless Computing: One Step Forward, Two Steps Back , 2018, CIDR.

[46]  Mert Pilanci,et al.  Straggler Resilient Serverless Computing Based on Polar Codes , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[47]  Anoosheh Heidarzadeh,et al.  Collaborative Decoding of Polynomial Codes for Distributed Computation , 2019, 2019 IEEE Information Theory Workshop (ITW).

[48]  Vipul Gupta,et al.  OverSketched Newton: Fast Convex Optimization for Serverless Systems , 2019, 2020 IEEE International Conference on Big Data (Big Data).