Falcon: Addressing Stragglers in Heterogeneous Parameter Server Via Multiple Parallelism

The parameter server architecture has shown promising performance advantages when handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving stragglers may not fully exploit the computation resource of the cluster as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to design a heterogeneity-aware parameter server paradigm that addresses stragglers and accelerates DL training from the perspective of computation parallelism. We introduce a novel methodology named <italic>straggler projection</italic> to give a comprehensive inspection of stragglers and reveal practical guidelines to solve this problem in two aspects: (1) controlling each worker's training speed via elastic training parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully utilize the computation resource. Following these guidelines, we propose the abstraction of <italic>parallelism</italic> as an infrastructure and design the <italic>Elastic-Parallelism Synchronous Parallel</italic> (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both <italic>enforced-</italic> and <italic>slack-synchronization</italic> schemes. The whole idea has been implemented into a prototype called <inline-formula><tex-math notation="LaTeX">${\sf Falcon}$</tex-math><alternatives><mml:math><mml:mi mathvariant="sans-serif">Falcon</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-2974461.gif"/></alternatives></inline-formula> which effectively accelerates the DL training speed with the presence of stragglers. Evaluation under various benchmarks with baseline comparison demonstrates the superiority of our system. Specifically, <inline-formula><tex-math notation="LaTeX">${\sf Falcon}$</tex-math><alternatives><mml:math><mml:mi mathvariant="sans-serif">Falcon</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq2-2974461.gif"/></alternatives></inline-formula> reduces the training convergence time, by up to 61.83, 55.19, 38.92, and 23.68 percent shorter than FlexRR, Sync-opt, ConSGD, and DynSGD, respectively.

[1]  Minyi Guo,et al.  Swallow: Joint Online Scheduling and Coflow Compression in Datacenter Networks , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Gregory R. Ganger,et al.  alsched: algebraic scheduling of mixed workloads in heterogeneous clouds , 2012, SoCC '12.

[4]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[7]  Jie Jiang,et al.  Angel: a new large-scale machine learning system , 2018 .

[8]  Raul Castro Fernandez,et al.  Ako: Decentralised Deep Learning with Partial Gradient Exchange , 2016, SoCC.

[9]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[12]  Bo Li,et al.  Fast Distributed Deep Learning via Worker-adaptive Batch Sizing , 2018, SoCC.

[13]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[14]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Nassir Navab,et al.  Analyzing and Exploiting NARX Recurrent Neural Networks for Long-Term Dependencies , 2017, ICLR.

[16]  Eric P. Xing,et al.  Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.

[17]  Minyi Guo,et al.  Falcon: Towards Computation-Parallel Deep Learning in Heterogeneous Parameter Server , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[18]  Yufei Tao,et al.  DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation , 2015, SIGMOD Conference.

[19]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[20]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[21]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[22]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[23]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[24]  Jianping Wu,et al.  BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training , 2018, NeurIPS.

[25]  Nassir Navab,et al.  Revisiting NARX Recurrent Neural Networks for Long-Term Dependencies , 2017, ArXiv.

[26]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[27]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[28]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.

[29]  Fan Yang,et al.  FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[30]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[31]  Adam Wierman,et al.  Hopper: Decentralized Speculation-aware Cluster Scheduling at Scale , 2015, SIGCOMM.

[32]  Marshall Copeland,et al.  Microsoft Azure , 2015, Apress.

[33]  Randy H. Katz,et al.  A Berkeley View of Systems Challenges for AI , 2017, ArXiv.

[34]  S. Sagar Imambi,et al.  PyTorch , 2021, Programming with TensorFlow.

[35]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[36]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[37]  Seunghak Lee,et al.  Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.

[38]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[39]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[40]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[41]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[42]  Song Guo,et al.  Cluster Frameworks for Efficient Scheduling and Resource Allocation in Data Center Networks: A Survey , 2018, IEEE Communications Surveys & Tutorials.

[43]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[44]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Minyi Guo,et al.  Fast Coflow Scheduling via Traffic Compression and Stage Pipelining in Datacenter Networks , 2019, IEEE Transactions on Computers.