论文信息 - Priority-based parameter propagation for distributed deep neural network training

Priority-based parameter propagation for distributed deep neural network training

Data parallel training is commonly used for scaling distributed Deep Neural Network ( DNN ) training. However, the performance benefits are often limited by the communication-heavy parameter synchronization step. In this work, we take advantage of the domain specific knowledge of DNN training and overlap parameter synchronization with computation in order to improve the training performance. We make two key observations: (1) the optimal data representation granularity for the communication may differ from that used by the underlying DNN model implementation and (2) different parameters can afford different synchronization delays. Based on these observations, we propose a new synchronization mechanism called Priority-based Parameter Propagation (P3). P3 synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay. We show that P3 can improve the training throughput of ResNet-50, Sockeye and VGG-19 by as much as 25%, 38% and 66% respectively on clusters with realistic network bandwidth.

Anand Jayarajan | Anand Jayarajan

[1] Wei Zhang,et al. AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training , 2017, AAAI.

[2] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[4] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[5] Amar Phanishayee,et al. TBD: Benchmarking and Analyzing Deep Neural Network Training , 2018, ArXiv.

[6] Shaohuai Shi,et al. Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs , 2017, 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[7] David Cunningham,et al. Gigabit Ethernet Networking , 1999 .

[8] 정혜동,et al. InfiniBand 연결망 기반 데이터 전송 시 상위 응용에 따른 최적 패킷 크기에 관한 연구 , 2015 .

[9] Abhinav Vishnu,et al. GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent , 2018, ArXiv.

[10] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[11] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[13] Dhabaleswar K. Panda,et al. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? , 2017, EuroMPI.

[14] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[15] Jing Li,et al. DNN Model Compression Under Accuracy Constraints , 2018 .

[16] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[17] Amar Phanishayee,et al. Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.

[18] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.