Prophet: Speeding up Distributed DNN Training with Predictable Communication Scheduling

Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on overlapping the computation and communication to improve DDNN training performance, the GPU and network resources are still under-utilized in DDNN training clusters. To tackle this issue, in this paper, we design and implement a predictable communication scheduling strategy named Prophet to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization. Leveraging our observed stepwise pattern of gradient transfer start time, Prophet first uses the monitored network bandwidth and the profiled time interval among gradients to predict the appropriate number of gradients that can be grouped into blocks. Then, these gradient blocks can be transferred one by one to guarantee high utilization of GPU and network resources while ensuring the priority of gradient transfer (i.e., low-priority gradients cannot preempt high-priority gradients in the network transfer). Prophet can make the forward propagation start as early as possible so as to greedily reduce the waiting (idle) time of GPU resources during the DDNN training process. Prototype experiments with representative DNN models trained on Amazon EC2 demonstrate that Prophet can improve the DDNN training performance by up to 40% compared with the state-of-the-art priority-based communication scheduling strategies, yet with negligible runtime performance overhead.

[1]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[2]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[3]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[4]  Panos Kalnis,et al.  Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.

[5]  Aditya Akella,et al.  Network-accelerated distributed machine learning for multi-tenant settings , 2020, SoCC.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Michael M. Swift,et al.  ATP: In-network Aggregation for Multi-tenant Learning , 2021, NSDI.

[8]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[9]  Yanjun Ma,et al.  PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice , 2019 .

[10]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  Aijun An,et al.  Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[13]  Bo Li,et al.  Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[14]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[15]  Shaohuai Shi,et al.  Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs , 2020, ArXiv.

[16]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[17]  Shaohuai Shi,et al.  MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms , 2018, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[18]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[19]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[20]  Janis Keuper,et al.  Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[21]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[22]  Chuan Wu,et al.  Preemptive All-reduce Scheduling for Expediting Distributed DNN Training , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[23]  Wencong Xiao,et al.  Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.

[24]  Amar Phanishayee,et al.  Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.

[25]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[26]  Shuai Wang,et al.  Geryon: Accelerating Distributed CNN Training by Network-Level Flow Scheduling , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[27]  David Patterson,et al.  MLPerf Training Benchmark , 2019, MLSys.

[28]  Xiaobo Zhou,et al.  Scalable Distributed DL Training: Batching Communication and Computation , 2019, AAAI.

[29]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.