Accelerating Distributed Training in Heterogeneous Clusters via a Straggler-Aware Parameter Server

Different from homogeneous clusters, when distributed training is performed in heterogeneous clusters, there will be great performance degradation due to the effect of stragglers. Instead of the synchronous stochastic optimization commonly used in homogeneous clusters, we choose an asynchronous approach, which does not require waiting for stragglers but has the problem of using stale parameters. To solve this problem, we design a straggler-aware parameter server (SaPS), which can detect stragglers through the version of parameters and mitigate their effect by a coordinator which can limit the staleness of parameters without waiting for stragglers. Experimental results show that SaPS can converge faster than fully synchronous, fully asynchronous and some SGD variants.

[1]  Kevin T. Pedretti,et al.  The impact of system design parameters on application noise sensitivity , 2010, 2010 IEEE International Conference on Cluster Computing.

[2]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[3]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  Parijat Dube,et al.  Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.

[5]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[6]  Chao Wang,et al.  MALOC: A Fully Pipelined FPGA Accelerator for Convolutional Neural Networks With All Layers Mapped on Chip , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[8]  R. Sindhu Reddy,et al.  DLAU: A Scalable Deep Learning Accelerator Unit on FPGA , 2018 .

[9]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[10]  Xuehai Zhou,et al.  SparseNN: A Performance-Efficient Accelerator for Large-Scale Sparse Neural Networks , 2017, International Journal of Parallel Programming.

[11]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[12]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[13]  Frank Wood,et al.  Bayesian Distributed Stochastic Gradient Descent , 2018, NeurIPS.

[14]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[15]  Jeffrey Dean,et al.  Achieving Rapid Response Times in Large Online Services , 2012 .

[16]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[17]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[18]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19]  Eric P. Xing,et al.  Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.

[20]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[21]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.