CEFS: compute-efficient flow scheduling for iterative synchronous applications

Iterative Synchronous Applications (ISApps) are popular in today's data centers, represented by distributed deep learning (DL) training. In ISApps, multiple nodes carry out the computing task iteratively, with globally synchronizing the results in each iteration. To increase the scaling efficiency of ISApps, in this paper we propose a new flow scheduling approach, called CEFS. CEFS saves the waiting time of computing nodes from two aspects. For a single node, flows with data which can trigger earlier computation at the node are assigned with higher priority; among nodes, flows towards slower nodes are assigned with higher priority. To address the challenges of realizing CEFS in real systems, e.g., the limited number of priority queues on commodity switches, the combination of two types of priorities, and the adaption to different applications and hardware environments, we design an online Bayesian optimization based priority assignment algorithm which meets a two-dimension order-preserving rule. We implement a CEFS prototype and evaluate CEFS through both a 16-node GPU/RoCEv2 testbed by training typical DL models and NS-3 simulations. Compared with TensorFlow and two representative scheduling solutions: TicTac and ByteScheduler, CEFS improves the training throughput by up to 253%, 252% and 47%, respectively. Besides, the scaling efficiency of the 16-node system under TensorFlow, TicTac, ByteScheduler and CEFS is 26.6%~46.9%, 26.7%~47.0%, 63.9%~80.3%, and 92.9%~94.7%, respectively. The NS-3 simulation results show that CEFS can even achieve similar scaling efficiency at a larger scale.

[1]  Alex X. Liu,et al.  Friends, not Foes – Synthesizing Existing Transport Strategies for Data Center Networks , 2014 .

[2]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[3]  Shuai Wang,et al.  Geryon: Accelerating Distributed CNN Training by Network-Level Flow Scheduling , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[4]  Joshua Romero,et al.  Exascale Deep Learning for Scientific Inverse Problems , 2019, ArXiv.

[5]  Yiming Zhang,et al.  Rate-aware flow scheduling for commodity data center networks , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[6]  Kai Chen,et al.  Towards Zero Copy Dataflows using RDMA , 2017, SIGCOMM Posters and Demos.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[9]  Ion Stoica,et al.  Coflow: a networking abstraction for cluster applications , 2012, HotNets-XI.

[10]  Amar Phanishayee,et al.  Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.

[11]  Shuai Wang,et al.  HiPS: Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning , 2018, NetAI@SIGCOMM.

[12]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2014, SIGCOMM.

[13]  Tao Zhang,et al.  EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Panos Kalnis,et al.  Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.

[15]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[16]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[17]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[18]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[19]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[20]  Dhabaleswar K. Panda,et al.  Accelerating TensorFlow with Adaptive RDMA-Based gRPC , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).

[21]  Panos Kalnis,et al.  In-Network Computation is a Dumb Idea Whose Time Has Come , 2017, HotNets.

[22]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[23]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[24]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[25]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[26]  Antony I. T. Rowstron,et al.  Decentralized task-aware scheduling for data center networks , 2014, SIGCOMM.

[27]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[28]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[29]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[30]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[31]  Wencong Xiao,et al.  Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications , 2018 .

[32]  Dan Li,et al.  Impact of Network Topology on the Performance of DML: Theoretical Analysis and Practical Factors , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[33]  Hiroaki Mikami,et al.  Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash , 2018 .

[34]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[35]  Ion Stoica,et al.  Efficient Coflow Scheduling Without Prior Knowledge , 2015, SIGCOMM.

[36]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[37]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[38]  Hai Jin,et al.  Heterogeneity and Interference-Aware Virtual Machine Provisioning for Predictable Performance in the Cloud , 2016, IEEE Transactions on Computers.

[39]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[40]  StoicaIon,et al.  Efficient Coflow Scheduling Without Prior Knowledge , 2015 .

[41]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[42]  Joseph Gonzalez,et al.  On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , 2018, ArXiv.

[43]  Yongqiang Xiong,et al.  Congestion Control for High-speed Extremely Shallow-buffered Datacenter Networks , 2017, APNet.

[44]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[46]  Wei Bai,et al.  Information-Agnostic Flow Scheduling for Commodity Data Centers , 2015, NSDI.

[47]  Bo Li,et al.  Fast Distributed Deep Learning via Worker-adaptive Batch Sizing , 2018, SoCC.

[48]  Shaohuai Shi,et al.  MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms , 2018, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[49]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[50]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[51]  Feng Liu,et al.  AuTO: scaling deep reinforcement learning for datacenter-scale automatic traffic optimization , 2018, SIGCOMM.

[52]  Michael J. Freedman,et al.  SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[53]  Jonas Mockus,et al.  Application of Bayesian approach to numerical methods of global and stochastic optimization , 1994, J. Glob. Optim..

[54]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[55]  Krishnendu Chakrabarty,et al.  Lotus: A New Topology for Large-scale Distributed Machine Learning , 2020, ACM J. Emerg. Technol. Comput. Syst..