Hierarchical Coding for Distributed Computing

Coding for distributed computing supports low-latency computation by relieving the burden of straggling workers. While most existing works assume a simple master-worker model, we consider a hierarchical computational structure consisting of groups of workers, motivated by the need to reflect the architectures of real-world distributed computing systems. In this work, we propose a hierarchical coding scheme for this model, as well as analyze its decoding cost and expected computation time. Specifically, we first provide upper and lower bounds on the expected computing time of the proposed scheme. We also show that our scheme enables efficient parallel decoding, thus reducing decoding costs by orders of magnitude over non-hierarchical schemes. When considering both decoding cost and computing time, the proposed hierarchical coding is shown to outperform existing schemes in many practical scenarios.

[1]  Pulkit Grover,et al.  Coded convolution for parallel and distributed computing within a deadline , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[2]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[3]  Anand Raghunathan,et al.  ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters , 2014, USENIX Annual Technical Conference.

[4]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[5]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[6]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[7]  Dimitris S. Papailiopoulos,et al.  Approximate Gradient Coding via Sparse Random Graphs , 2017, ArXiv.

[8]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[9]  Alexandros G. Dimakis,et al.  Gradient Coding From Cyclic MDS Codes and Expander Graphs , 2017, IEEE Transactions on Information Theory.

[10]  Kangwook Lee,et al.  Matrix sparsification for coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[13]  Babak Hassibi,et al.  Balanced Reed-Solomon codes for all parameters , 2016, 2016 IEEE Information Theory Workshop (ITW).

[14]  Amin Vahdat,et al.  Scale-Out Networking in the Data Center , 2010, IEEE Micro.

[15]  Babak Hassibi,et al.  Balanced Reed-Solomon codes , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[16]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[17]  Kannan Ramchandran,et al.  High-dimensional coded matrix multiplication , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[18]  V. Lalitha,et al.  Locality-aware hybrid coded MapReduce for server-rack architecture , 2017, 2017 IEEE Information Theory Workshop (ITW).

[19]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[20]  Mohammad Ali Maddah-Ali,et al.  Coded fourier transform , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[21]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.