Slack squeeze coded computing for adaptive straggler mitigation

While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. In this paper, we propose a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation (S2C2). S2C2 squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data. We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes. We evaluate S2C2 on linear algebraic algorithms, gradient descent, graph ranking, and graph filtering algorithms. We demonstrate 19% to 39% reduction in total computation latency using S2C2 compared to job replication and coded computation. We further show how S2C2 can be applied beyond matrix-vector multiplication.

[1]  Ray Hill,et al.  A First Course in Coding Theory , 1988 .

[2]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[3]  Chris Peterson,et al.  Implementing a Performance Forecasting System for Metacomputing The Network Weather Service , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Peter A. Dinda Online prediction of the running time of tasks , 2001, SIGMETRICS '01.

[6]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[7]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[8]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[9]  Jiagui Chen Economy of China Analysis and Forecast (2013) , 2012 .

[10]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[11]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[12]  Nihar B. Shah,et al.  When do redundant requests reduce latency ? , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[13]  P. Renteln Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physicists , 2013 .

[14]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[15]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[16]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[17]  Gauri Joshi,et al.  Efficient task replication for fast response times in parallel computation , 2014, SIGMETRICS.

[18]  Abhishek Gupta,et al.  Parallel Programming with Migratable Objects: Charm++ in Practice , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[20]  Gregory W. Wornell,et al.  Efficient task replication for fast response times in parallel computation , 2014, SIGMETRICS '14.

[21]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[22]  Kannan Ramchandran,et al.  On scheduling redundant requests with cancellation overheads , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[23]  Thu D. Nguyen,et al.  ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.

[24]  Erik Saule,et al.  Replicated Data Placement for Uncertain Scheduling , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[25]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.

[26]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.

[27]  Mohammad Ali Maddah-Ali,et al.  A Unified Coding Framework for Distributed Computing with Straggling Servers , 2016, 2016 IEEE Globecom Workshops (GC Wkshps).

[28]  Lingjia Tang,et al.  Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[29]  Mohammad Ali Maddah-Ali,et al.  Coded TeraSort , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[30]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[31]  Ronald G. Dreslinski,et al.  Reining in Long Tails in Warehouse-Scale Computers with Quick Voltage Boosting Using Adrenaline , 2017, ACM Trans. Comput. Syst..

[32]  Mor Harchol-Balter,et al.  WorkloadCompactor: reducing datacenter cost while providing tail latency SLO guarantees , 2017, SoCC.

[33]  Soummya Kar,et al.  Coding Method for Parallel Iterative Linear Solver , 2017, ArXiv.

[34]  Vipul Gupta,et al.  A sequential approximation framework for coded distributed optimization , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[35]  Amir Salman Avestimehr,et al.  Coded computation over heterogeneous clusters , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[36]  Shivaram Venkataraman,et al.  Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation , 2018, ArXiv.

[37]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[38]  Vipul Gupta,et al.  OverSketch: Approximate Matrix Multiplication for the Cloud , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[39]  Xuehai Qian,et al.  Hop: Heterogeneity-aware Decentralized Training , 2019, ASPLOS.

[40]  Amir Salman Avestimehr,et al.  Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[41]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[42]  Shivaram Venkataraman,et al.  Parity Models: A General Framework for Coding-Based Resilience in ML Inference , 2019, ArXiv.

[43]  Amir Salman Avestimehr,et al.  Collage Inference: Tolerating Stragglers in Distributed Neural Network Inference using Coding , 2019, ArXiv.

[44]  Amir Salman Avestimehr,et al.  CodedPrivateML: A Fast and Privacy-Preserving Framework for Distributed Machine Learning , 2019, IEEE Journal on Selected Areas in Information Theory.