Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication

Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of straggling nodes -- computing nodes that unpredictably slowdown or fail -- is a major bottleneck in such distributed computations. Ideal load balancing strategies that dynamically allocate more tasks to faster nodes require knowledge or monitoring of node speeds as well as the ability to quickly move data. Recently proposed fixed-rate erasure coding strategies can handle unpredictable node slowdown, but they ignore partial work done by straggling nodes thus resulting in a lot of redundant computation. We propose a rateless fountain coding strategy that achieves the best of both worlds -- we prove that its latency is asymptotically equal to ideal load balancing, and it performs asymptotically zero redundant computations. Our idea is to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes. The original matrix-vector product can be decoded as soon as slightly more than m row-vector products are collectively finished by the nodes. We conduct experiments in three computing environments: local parallel computing, Amazon EC2, and Amazon Lambda, which show that rateless coding gives as much as 3x speed-up over uncoded schemes.

[1]  Mohammad Ali Maddah-Ali,et al.  A Unified Coding Framework for Distributed Computing with Straggling Servers , 2016, 2016 IEEE Globecom Workshops (GC Wkshps).

[2]  Gregory W. Wornell,et al.  Efficient task replication for fast response times in parallel computation , 2014, SIGMETRICS '14.

[3]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[4]  Mohammad Ali Maddah-Ali,et al.  Coded fourier transform , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[5]  Amir Salman Avestimehr,et al.  Near-Optimal Straggler Mitigation for Distributed Gradient Methods , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[6]  Ion Stoica,et al.  Numpywren: Serverless Linear Algebra , 2018, ArXiv.

[7]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[8]  Emina Soljanin,et al.  Coding for fast content download , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[9]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[10]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[11]  Albin Severinson,et al.  Block-Diagonal and LT Codes for Distributed Computing With Straggling Servers , 2017, IEEE Transactions on Communications.

[12]  Farzin Haddadpour,et al.  On the optimal recovery threshold of coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[13]  Emina Soljanin,et al.  On the Delay-Storage Trade-Off in Content Download from Coded Distributed Storage Systems , 2013, IEEE Journal on Selected Areas in Communications.

[14]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[15]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[17]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[18]  P. Maymounkov Online codes , 2002 .

[19]  Ness B. Shroff,et al.  Provably delay efficient data retrieving in storage clouds , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[20]  Ness B. Shroff,et al.  Coded Sparse Matrix Multiplication , 2018, ICML.

[21]  Suhas N. Diggavi,et al.  Encoded distributed optimization , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[22]  Asser N. Tantawi,et al.  Approximate Analysis of Fork/Join Synchronization in Parallel Queues , 1988, IEEE Trans. Computers.

[23]  Vipul Gupta,et al.  OverSketch: Approximate Matrix Multiplication for the Cloud , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[24]  Pulkit Grover,et al.  Coded convolution for parallel and distributed computing within a deadline , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[25]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[26]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[27]  Michael Luby,et al.  LT codes , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[28]  Emina Soljanin,et al.  Queues with Redundancy: Latency-Cost Analysis , 2015, PERV.

[29]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[30]  Joong Bum Rhim,et al.  Fountain Codes , 2010 .

[31]  Stephen L. Olivier,et al.  Dynamic Load Balancing of Unbalanced Computations Using Message Passing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[32]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[33]  Other Contributors Are Indicated Where They Contribute Python Software Foundation , 2017 .

[34]  Eric P. Xing,et al.  Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.

[35]  Soummya Kar,et al.  Coded Distributed Computing for Inverse Problems , 2017, NIPS.

[36]  Mor Harchol-Balter,et al.  Analyzing Response Time in the Redundancy-d System , 2015 .

[37]  Kangwook Lee,et al.  Matrix sparsification for coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[38]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[39]  Gauri Joshi Boosting Service Capacity via Adaptive Task Replication , 2017, PERV.

[40]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[41]  Shivaram Venkataraman,et al.  Parity Models: A General Framework for Coding-Based Resilience in ML Inference , 2019, ArXiv.

[42]  Kannan Ramchandran,et al.  The MDS queue: Analysing the latency performance of erasure codes , 2012, 2014 IEEE International Symposium on Information Theory.

[43]  Gregory W. Wornell,et al.  Efficient Straggler Replication in Large-Scale Parallel Computing , 2015, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[44]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[45]  Marcel Bauer,et al.  Numerical Methods for Partial Differential Equations , 1994 .

[46]  Gauri Joshi,et al.  Synergy via Redundancy: Boosting Service Capacity with Adaptive Replication , 2018, PERV.

[47]  Ness B. Shroff,et al.  On Delay-Optimal Scheduling in Queueing Systems with Replications , 2016, ArXiv.

[48]  Shivaram Venkataraman,et al.  Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation , 2018, ArXiv.

[49]  Alan Scheller-Wolf,et al.  A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size , 2016, IEEE/ACM Transactions on Networking.

[50]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[51]  Mor Harchol-Balter,et al.  Performance Modeling and Design of Computer Systems: Queueing Theory in Action , 2013 .

[52]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.

[53]  Gregory W. Wornell,et al.  Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing , 2015, PERV.

[54]  Ashok K. Agrawala,et al.  Analysis of the Fork-Join Queue , 1989, IEEE Trans. Computers.

[55]  Kannan Ramchandran,et al.  Codes can reduce queueing delay in data centers , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[56]  Geoffrey C. Fox,et al.  Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..

[57]  Emina Soljanin,et al.  Efficient Redundancy Techniques for Latency Reduction in Cloud Systems , 2015, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[58]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.