Random Convolutional Coding for Robust and Straggler Resilient Distributed Matrix Computation

Distributed matrix computations (matrix-vector and matrix-matrix multiplications) are at the heart of several tasks within the machine learning pipeline. However, distributed clusters are well-recognized to suffer from the problem of stragglers (slow or failed nodes). Prior work in this area has presented straggler mitigation strategies based on polynomial evaluation/interpolation. However, such approaches suffer from numerical problems (blow up of round-off errors) owing to the high condition numbers of the corresponding Vandermonde matrices. In this work, we introduce a novel solution approach that relies on embedding distributed matrix computations into the structure of a convolutional code. This simple innovation allows us to develop a provably numerically robust and efficient (fast) solution for distributed matrix-vector and matrix-matrix multiplication.

[1]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.

[2]  Malhar Chaudhari,et al.  Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[3]  Mario Blaum,et al.  Cross parity check convolutional codes , 1989, IEEE Trans. Inf. Theory.

[4]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[5]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[6]  Ness B. Shroff,et al.  Coded Sparse Matrix Multiplication , 2018, ICML.

[7]  Anindya Bijoy Das,et al.  Distributed Matrix-Vector Multiplication: A Convolutional Coding Approach , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[8]  Jean Pierre Delmas,et al.  Asymptotic eigenvalue distribution of block Toeplitz matrices and application to blind SIMO channel identification , 2001, IEEE Trans. Inf. Theory.

[9]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[10]  I. G. MacDonald,et al.  Symmetric functions and Hall polynomials , 1979 .

[11]  Robert M. Gray,et al.  Toeplitz And Circulant Matrices: A Review (Foundations and Trends(R) in Communications and Information Theory) , 2006 .

[12]  Daniel J. Costello,et al.  Error Control Coding, Second Edition , 2004 .

[13]  Viveck R. Cadambe,et al.  Numerically Stable Polynomially Coded Computing , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[14]  Li Tang,et al.  Universally Decodable Matrices for Distributed Matrix-Vector Multiplication , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[15]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16]  Shuangzhe Liu,et al.  Hadamard, Khatri-Rao, Kronecker and Other Matrix Products , 2008 .

[17]  Mohammad Ali Maddah-Ali,et al.  Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).