Compressed Coded Distributed Computing

Communication overhead is one of the major performance bottlenecks in large-scale distributed computing systems, in particular for machine learning applications. Conventionally, compression techniques are used to reduce the load of communication by combining intermediate results of the same computation task as much as possible. Recently, via the development of coded distributed computing (CDC), it has been shown that it is possible to enable coding opportunities across intermediate results of different computation tasks to further reduce the communication load. We propose a new scheme, named compressed coded distributed computing (in short, compressed CDC), which jointly exploits the above two techniques (i.e., combining the intermediate results of the same computation and coding across the intermediate results of different computations) to significantly reduce the communication load for computations with linear aggregation (reduction) of intermediate results in the final stage that are prevalent in machine learning (e.g., distributed training algorithms where partial gradients are computed distributedly and then averaged in the final stage). In particular, compressed CDC first compresses/combines several intermediate results for a single computation, and then utilizes multiple such combined packets to create a coded multicast packet that is simultaneously useful for multiple computations. We characterize the achievable communication load of compressed CDC and show that it substantially outperforms both combining methods and CDC scheme. Based on the compressed CDC technique, we then study a distributed training problem as one of its application. We characterize the communication load for this distributed training problem and show that it is asymptotically optimal.

[1]  Mohammad Ali Maddah-Ali,et al.  A Unified Coding Framework for Distributed Computing with Straggling Servers , 2016, 2016 IEEE Globecom Workshops (GC Wkshps).

[2]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[3]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[4]  Mohammad Ali Maddah-Ali,et al.  Coded Distributed Computing: Straggling Servers and Multistage Dataflows , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[5]  Christina Fragouli,et al.  Communication vs distributed computation: An alternative trade-off curve , 2017, 2017 IEEE Information Theory Workshop (ITW).

[6]  Mohammad Ali Maddah-Ali,et al.  Coded MapReduce , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[7]  Ravi Tandon,et al.  Information Theoretic Limits of Data Shuffling for Distributed Learning , 2016, 2016 IEEE Global Communications Conference (GLOBECOM).

[8]  Mohammad Ali Maddah-Ali,et al.  Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[11]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[12]  Osvaldo Simeone,et al.  Improved Latency-communication Trade-off for Map-shuffle-reduce Systems with Stragglers , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Alexandros G. Dimakis,et al.  Gradient Coding , 2016, ArXiv.

[14]  Christina Fragouli,et al.  A pliable index coding approach to data shuffling , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[15]  A. Salman Avestimehr,et al.  A Fundamental Tradeoff Between Computation and Communication in Distributed Computing , 2016, IEEE Transactions on Information Theory.

[16]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[17]  Mohammad Ali Maddah-Ali,et al.  Coded TeraSort , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[18]  Amir Salman Avestimehr,et al.  On Heterogeneous Coded Distributed Computing , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[19]  Mohammad Ali Maddah-Ali,et al.  Compressed Coded Distributed Computing , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[20]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[21]  Mohammad Ali Maddah-Ali,et al.  Coding for Distributed Fog Computing , 2017, IEEE Communications Magazine.

[22]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[23]  Xiaohu Tang,et al.  A Fundamental Storage-Communication Tradeoff in Distributed Computing with Straggling Nodes , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[24]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[25]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[26]  Mohammad Ali Maddah-Ali,et al.  Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication , 2017, NIPS.

[27]  A. Salman Avestimehr,et al.  A Scalable Framework for Wireless Distributed Computing , 2016, IEEE/ACM Transactions on Networking.

[28]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[29]  Christina Fragouli,et al.  Distributed Computing Trade-offs with Random Connectivity , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[31]  A. Barron,et al.  Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning , 2020 .

[32]  Sheng Yang,et al.  Storage, Computation, and Communication: A Fundamental Tradeoff in Distributed Computing , 2018, 2018 IEEE Information Theory Workshop (ITW).

[33]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[34]  Aditya Ramamoorthy,et al.  Leveraging Coding Techniques for Speeding up Distributed Computing , 2018, 2018 IEEE Global Communications Conference (GLOBECOM).

[35]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[36]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.