Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing

The current big data era routinely requires the processing of large-scale data on massive distributed computing clusters. In these applications, data sets are often so large that they cannot be housed in the memory and/or the disk of any one computer. Thus, the data and the processing are typically distributed across multiple nodes. Distributed computation is thus a necessity rather than a luxury. The widespread use of such clusters presents several opportunities and advantages over traditional computing paradigms. However, it also presents newer challenges where coding-theoretic ideas have recently had a significant impact. Large-scale clusters (which can be heterogeneous in nature) suffer from the problem of stragglers, which are slow or failed worker nodes in the system. Thus, the overall speed of a computation is typically dominated by the slowest node in the absence of a sophisticated assignment of tasks to the worker nodes.

[1]  W. Wardlaw,et al.  Matrix Representation of Finite Fields , 1994 .

[2]  Aditya Ramamoorthy,et al.  Random Convolutional Coding for Robust and Straggler Resilient Distributed Matrix Computation , 2019, ArXiv.

[3]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[4]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[5]  Farzin Haddadpour,et al.  On the optimal recovery threshold of coded matrix multiplication , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[6]  Ness B. Shroff,et al.  Computation Efficient Coded Linear Transform , 2019, AISTATS.

[7]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[8]  Andrew E. Yagle,et al.  Fast algorithms for matrix multiplication using pseudo-number-theoretic transforms , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Li Tang,et al.  Universally Decodable Matrices for Distributed Matrix-Vector Multiplication , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[12]  Victor Y. Pan,et al.  How Bad Are Vandermonde Matrices? , 2015, SIAM J. Matrix Anal. Appl..

[13]  Malhar Chaudhari,et al.  Fast and Efficient Distributed Matrix-vector Multiplication Using Rateless Fountain Codes , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  L. Litwin,et al.  Error control coding , 2001 .

[15]  R. Srikant,et al.  Mean-Field Analysis of Coding Versus Replication in Large Data Storage Systems , 2018, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[16]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[17]  Andrew E. Yagle Fast algorithms for matrix multiplication using pseudo-number-theoretic transforms , 1995, IEEE Trans. Signal Process..

[18]  J.A. Abraham,et al.  Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures , 1986, Proceedings of the IEEE.

[19]  V. Pan TR-2013003: Polynomial Evaluation and Interpolation: Fast and Stable Approximate Solution , 2013 .

[20]  Anindya Bijoy Das,et al.  Distributed Matrix-Vector Multiplication: A Convolutional Coding Approach , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[21]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[22]  Aditya Ramamoorthy,et al.  Numerically stable coded matrix computations via circulant and rotation matrix embeddings , 2019, 2021 IEEE International Symposium on Information Theory (ISIT).

[23]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[24]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[25]  Ness B. Shroff,et al.  Coded Sparse Matrix Multiplication , 2018, ICML.

[26]  Gregory W. Wornell,et al.  Efficient Straggler Replication in Large-Scale Parallel Computing , 2015, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[27]  Anoosheh Heidarzadeh,et al.  Random Khatri-Rao-Product Codes for Numerically-Stable Distributed Matrix Multiplication , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[28]  Emina Soljanin,et al.  Efficient Redundancy Techniques for Latency Reduction in Cloud Systems , 2015, ACM Trans. Model. Perform. Evaluation Comput. Syst..

[29]  Mohammad Ali Maddah-Ali,et al.  Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[30]  Viveck R. Cadambe,et al.  Numerically Stable Polynomially Coded Computing , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).