Elastic parameter server load distribution in deep learning clusters

In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS load distribution. We design PSLD, a dynamic parameter server load distribution scheme, to mitigate PS straggler issues and accelerate distributed model training in the PS architecture. An exploitation-exploration method is carefully designed to scale in and out parameter servers and adjust parameter distribution among PSs on the go. We also design an elastic PS scaling module to carry out our scheme with little interruption to the training process. We implement our module on top of open-source PS architectures, including MXNet and BytePS. Testbed experiments show up to 2.86x speed-up in model training with PSLD, for different ML models under various straggler settings.

[1]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[2]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[5]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[6]  Gregory R. Ganger,et al.  Proteus: agile ML elasticity through tiered reliability in dynamic resource markets , 2017, EuroSys.

[7]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[8]  Richard G. Baraniuk,et al.  Connection-level analysis and modeling of network traffic , 2001, IMW '01.

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Bo Li,et al.  Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Eric P. Xing,et al.  Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.

[15]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[16]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[17]  Michael J. Freedman,et al.  Resource Elasticity in Distributed Deep Learning , 2020, MLSys.

[18]  Shengen Yan,et al.  Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach , 2017, 2017 IEEE International Conference on Smart Computing (SMARTCOMP).

[19]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  G. Newman,et al.  CONFIDENCE INTERVALS , 1987, The Lancet.

[21]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[22]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[23]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[24]  Wei Lin,et al.  DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters , 2019, IEEE Transactions on Parallel and Distributed Systems.

[25]  Xin Yuan,et al.  Bandwidth Efficient All-reduce Operation on Tree Topologies , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[26]  Shuai Wang,et al.  Accelerating Distributed Machine Learning by Smart Parameter Server , 2019, APNet.

[27]  H. Robbins A Stochastic Approximation Method , 1951 .

[28]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[29]  Zongpeng Li,et al.  Online Job Scheduling in Distributed Machine Learning Clusters , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[30]  Chuan Wu,et al.  Deep Learning-based Job Placement in Distributed Machine Learning Clusters , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[32]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[33]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[34]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[35]  Abutalib Aghayev,et al.  Litz: Elastic Framework for High-Performance Distributed Machine Learning , 2018, USENIX Annual Technical Conference.

[36]  Aijun An,et al.  Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[37]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[38]  Xuehai Qian,et al.  Hop: Heterogeneity-aware Decentralized Training , 2019, ASPLOS.

[39]  Chuan Wu,et al.  Preemptive All-reduce Scheduling for Expediting Distributed DNN Training , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[40]  Arthur Charguéraud,et al.  Scheduling parallel programs by work stealing with private deques , 2013, PPoPP '13.

[41]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.