Coded computation for multicore setups

Consider a distributed computing setup consisting of a master node and n worker nodes, each equipped with p cores, and a function f (x) = g(f1(x), f2(x),…, fk(x)), where each fi can be computed independently of the rest. Assuming that the worker computational times have exponential tails, what is the minimum possible time for computing f? Can we use coding theory principles to speed up this distributed computation? In [1], it is shown that distributed computing of linear functions can be expedited by applying linear erasure codes. However, it is not clear if linear codes can speed up distributed computation of ‘nonlinear’ functions as well. To resolve this problem, we propose the use of sparse linear codes, exploiting the modern multicore processing architecture. We show that 1) our coding solution achieves the order optimal runtime, and 2) it is at least Θ(√log n) times faster than any uncoded schemes where the number of workers is n.

[1]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.

[2]  Mark Rudelson,et al.  Invertibility of Sparse non-Hermitian matrices , 2015, 1507.03525.

[3]  Alexandros G. Dimakis,et al.  Gradient Coding , 2016, ArXiv.

[4]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[5]  Mohammad Ali Maddah-Ali,et al.  Coded MapReduce , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[6]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.

[7]  Erik Saule,et al.  Replicated Data Placement for Uncertain Scheduling , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[8]  Amir Salman Avestimehr,et al.  Coded Computation Over Heterogeneous Clusters , 2019, IEEE Transactions on Information Theory.

[9]  Rüdiger L. Urbanke,et al.  Modern Coding Theory , 2008 .

[10]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[11]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[12]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[13]  Gregory W. Wornell,et al.  Efficient task replication for fast response times in parallel computation , 2014, SIGMETRICS '14.

[14]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[15]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.