AutoByte: Automatic Configuration for Optimal Communication Scheduling in DNN Training

ByteScheduler partitions and rearranges tensor transmissions to improve the communication efficiency of distributed Deep Neural Network (DNN) training. The configuration of hyper-parameters (i.e., the partition size and the credit size) is critical to the effectiveness of partitioning and rearrangement. Currently, ByteScheduler adopts Bayesian Optimization (BO) to find the optimal configuration for the hyper-parameters beforehand. In practice, however, various runtime factors (e.g., worker node status and network conditions) change over time, making the statically-determined one-shot configuration result suboptimal for real-world DNN training. To address this problem, we present a real-time configuration method (called AutoByte) that automatically and timely searches the optimal hyper-parameters as the training systems dynamically change. AutoByte extends the ByteScheduler framework with a meta-network, which takes the system’s runtime statistics as its input and outputs predictions for speedups under specific configurations. Evaluation results on various DNN models show that AutoByte can dynamically tune the hyper-parameters with low resource usage, and deliver up to 33.2% higher performance than the best static configuration in ByteScheduler.

[1]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[2]  Luciano Floridi,et al.  GPT-3: Its Nature, Scope, Limits, and Consequences , 2020, Minds and Machines.

[3]  Forrest N. Iandola,et al.  DenseNet: Implementing Efficient ConvNet Descriptor Pyramids , 2014, ArXiv.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Michael M. Swift,et al.  ATP: In-network Aggregation for Multi-tenant Learning , 2021, NSDI.

[6]  Mo Dong,et al.  PCC: Re-architecting Congestion Control for Consistent High Performance , 2014, NSDI.

[7]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[8]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2014, SIGCOMM.

[9]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[10]  Kai Chen,et al.  Rethinking Transport Layer Design for Distributed Machine Learning , 2019, APNet.

[11]  Sheng Wang,et al.  Rapier: Integrating routing and scheduling for coflow-aware data center networks , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[12]  Hao Wang,et al.  Domain-specific Communication Optimization for Distributed DNN Training , 2020, ArXiv.

[13]  Wei Bai,et al.  Information-Agnostic Flow Scheduling for Commodity Data Centers , 2015, NSDI.

[14]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[15]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[16]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[17]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Byung-Gon Chun,et al.  Automating System Configuration of Distributed Machine Learning , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[20]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[21]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[22]  Michael K. Buckland,et al.  Annual Review of Information Science and Technology , 2006, J. Documentation.

[23]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[24]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[25]  Yiming Zhang,et al.  Rate-aware flow scheduling for commodity data center networks , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[26]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[29]  Feng Liu,et al.  AuTO: scaling deep reinforcement learning for datacenter-scale automatic traffic optimization , 2018, SIGCOMM.

[30]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[31]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[32]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[33]  Pieter Hintjens,et al.  ZeroMQ: Messaging for Many Applications , 2013 .

[34]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[35]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[36]  Kai Chen,et al.  Programmable Switch as a Parallel Computing Device , 2018, ArXiv.

[37]  Kai Chen,et al.  RAT - Resilient Allreduce Tree for Distributed Machine Learning , 2020, APNet.

[38]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[39]  Chuck Yoo,et al.  TensorExpress: In-Network Communication Scheduling for Distributed Deep Learning , 2020, 2020 IEEE 13th International Conference on Cloud Computing (CLOUD).

[40]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[41]  Kaiyong Zhao,et al.  AutoML: A Survey of the State-of-the-Art , 2019, Knowl. Based Syst..

[42]  Kai Chen,et al.  Towards Zero Copy Dataflows using RDMA , 2017, SIGCOMM Posters and Demos.

[43]  Panos Kalnis,et al.  Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.

[44]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[45]  Kai Chen,et al.  Divide-and-Shuffle Synchronization for Distributed Machine Learning , 2020, ArXiv.

[46]  Kai Chen,et al.  Stream: Decentralized opportunistic inter-coflow scheduling for datacenter networks , 2016, 2016 IEEE 24th International Conference on Network Protocols (ICNP).

[47]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[50]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[51]  Yanhui Geng,et al.  CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark , 2016, SIGCOMM.

[52]  Haitao Wu,et al.  Towards minimal-delay deadline-driven data center TCP , 2013, HotNets.