Grouper: Accelerating Hyperparameter Searching in Deep Learning Clusters With Network Scheduling

Training a high-accuracy model requires trying hundreds of configurations of hyperparameters to search for the optimal configuration. It is common to launch a group of training jobs (named cojob) with different configurations at the same time and stop the jobs performing worst every stage (i.e., a certain number of iterations). Thus deep learning requires minimizing stage completion time (SCT) to accelerate the searching. To quickly complete the stages, each job in the cojob typically uses multiple GPUs to perform distributed training. The GPUs exchange data per iteration to synchronize their models through the network. However, data transfers of DL jobs compete for network bandwidth since the GPU cluster hosts a number of cojobs from various users, resulting in network congestion and consequently a large SCT for cojobs. Existing flow schedulers aimed at reducing flow/coflow/job completion time mismatch the requirement of hyperparameter searching. In this paper, we implement a system Grouper to minimize average SCT for cojobs. Grouper adopts a well-designed algorithm to permute stages of cojobs and schedules flows from different stages in the order of the permutation. The extensive testbed experiments and simulations show that Grouper outperforms advanced network designs Baraat, Sincrona, and per-flow fair share.

[1]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[2]  Beng Chin Ooi,et al.  Rafiki: Machine Learning as an Analytics Service System , 2018, Proc. VLDB Endow..

[3]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[4]  Bertrand M. T. Lin,et al.  Parallel dedicated machine scheduling with conflict graphs , 2018, Comput. Ind. Eng..

[5]  Chen Tian,et al.  Scheduling Cofiows of Multi-stage Jobs to Minimize the Total Weighted Job Completion Time , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[6]  Michael J. Freedman,et al.  SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[7]  Lars Kotthoff,et al.  Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..

[8]  Andreas S. Schulz Scheduling to Minimize Total Weighted Completion Time: Performance Guarantees of LP-Based Heuristics and Lower Bounds , 1996, IPCO.

[9]  Wei Bai,et al.  Information-Agnostic Flow Scheduling for Commodity Data Centers , 2015, NSDI.

[10]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[11]  Enhong Chen,et al.  One more queue is enough: Minimizing flow completion time with explicit priority notification , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[12]  Hong Liu,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[13]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2014, SIGCOMM.

[14]  Fred Baker,et al.  Configuration Guidelines for DiffServ Service Classes , 2006, RFC.

[15]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[16]  Matthieu Cord,et al.  GoSGD: Distributed Optimization for Deep Learning with Gossip Exchange , 2018, Neurocomputing.

[17]  Antony I. T. Rowstron,et al.  Decentralized task-aware scheduling for data center networks , 2014, SIGCOMM.

[18]  Thomas R. Henderson,et al.  Network Simulations with the ns-3 Simulator , 2008 .

[19]  Olatunji Ruwase,et al.  HyperDrive: exploring hyperparameters with POP scheduling , 2017, Middleware.

[20]  Michel X. Goemans,et al.  Improved approximation algorthims for scheduling with release dates , 1997, SODA '97.

[21]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  Samir Khuller,et al.  Select and Permute: An Improved Online Framework for Scheduling to Minimize Weighted Completion Time , 2017, LATIN.

[23]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[24]  Chuan Wu,et al.  Deep Learning-based Job Placement in Distributed Machine Learning Clusters , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[25]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[26]  Wencong Xiao,et al.  Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications , 2018 .

[27]  Sheng Wang,et al.  Towards Practical and Near-Optimal Coflow Scheduling for Data Center Networks , 2016, IEEE Transactions on Parallel and Distributed Systems.

[28]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[29]  David B. Shmoys,et al.  Scheduling to Minimize Average Completion Time: Off-Line and On-Line Approximation Algorithms , 1997, Math. Oper. Res..

[30]  Nick McKeown,et al.  A Distributed Algorithm to Calculate Max-Min Fair Rates Without Per-Flow State , 2019, Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems.

[31]  Amin Vahdat,et al.  Sincronia: near-optimal network design for coflows , 2018, SIGCOMM.

[32]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[33]  Khaled A. Harras,et al.  Eiffel: Efficient and Flexible Software Packet Scheduling , 2018, NSDI.

[34]  Jianping Wu,et al.  Joint optimization of tasks placement and routing to minimize Coflow Completion Time , 2019, J. Netw. Comput. Appl..

[35]  Maurice Queyranne,et al.  Structure of a simple scheduling polyhedron , 1993, Math. Program..

[36]  Wencong Xiao,et al.  Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.

[37]  Ke Li,et al.  Efficient File Dissemination in Data Center Networks With Priority-Based Adaptive Multicast , 2020, IEEE Journal on Selected Areas in Communications.

[38]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[39]  Samir Khuller,et al.  On Scheduling Coflows , 2020, Algorithmica.

[40]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[41]  Lars Kotthoff,et al.  Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.

[42]  Xingang Shi,et al.  Efficient Scheduling of Weighted Coflows in Data Centers , 2019, IEEE Transactions on Parallel and Distributed Systems.

[43]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[44]  Hans Kellerer,et al.  Parallel dedicated machines scheduling with chain precedence constraints , 2012, Eur. J. Oper. Res..

[45]  Amit Kumar,et al.  Order Scheduling Models: Hardness and Algorithms , 2007, FSTTCS.

[46]  Ola Svensson,et al.  Minimizing the sum of weighted completion times in a concurrent open shop , 2010, Oper. Res. Lett..

[47]  Alessandro Agnetis,et al.  Scheduling three chains on two parallel machines , 2010, Eur. J. Oper. Res..

[48]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.