BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes

Supercomputer FCFS-based scheduling policies result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node×time hole in a supercomputer’s schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administratoror user-defined metrics. We validate our method with supercomputer scheduler logs and different DNN training scenarios, and demonstrate efficiencies of up to 93% compared with running the same training tasks on dedicated nodes. Our method thus enables substantial supercomputer resources to be allocated to DNN training with no impact on other applications.

[1]  Jan Kautz,et al.  UNAS: Differentiable Architecture Search Meets Reinforcement Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Zhiling Lan,et al.  Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne , 2017, JSSPP.

[3]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[4]  Anne E. James,et al.  Priority-grouping method for parallel multi-scheduling in Grid , 2015, J. Comput. Syst. Sci..

[5]  Michael E. Papka,et al.  Characterization and identification of HPC applications at leadership computing facility , 2020, ICS.

[6]  Shantenu Jha,et al.  SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[7]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Igor Sfiligoi,et al.  The Pilot Way to Grid Resources Using glideinWMS , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[9]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[10]  Shantenu Jha,et al.  A Comprehensive Perspective on Pilot-Job Systems , 2015, ACM Comput. Surv..

[11]  Dmitry N. Zotkin,et al.  Job-length estimation and performance in backfilling schedulers , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[12]  Francine Berman,et al.  Application-Level Scheduling on Distributed Heterogeneous Networks , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[13]  E. M. L. Beale,et al.  Global optimization using special ordered sets , 1976, Math. Program..

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[16]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[17]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Hossein Bobarshad,et al.  HyperTune: Dynamic Hyperparameter Tuning for Efficient Distribution of DNN Training Over Heterogeneous Systems , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[19]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[20]  Emad Barsoum,et al.  Scaling Distributed Training with Adaptive Summation , 2020, MLSys.

[21]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Eduardo Huedo,et al.  GWpilot: Enabling multi-level scheduling in distributed infrastructures with GridWay and pilot jobs , 2015, Future Gener. Comput. Syst..

[25]  Anand Sivasubramaniam,et al.  An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration , 2001, JSSPP.

[26]  P. Sadayappan,et al.  Characterization of backfilling strategies for parallel job scheduling , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[27]  Paul Marshall,et al.  Improving Utilization of Infrastructure Clouds , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[28]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[29]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[32]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[33]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[34]  Zhiling Lan,et al.  Deep Reinforcement Agent for Scheduling in HPC , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Dror G. Feitelson,et al.  Supporting priorities and improving utilization of the IBM SP scheduler using slack-based backfilling , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[36]  W. Allcock,et al.  Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Ion Stoica,et al.  HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline , 2019, SoCC.

[38]  Eric P. Xing,et al.  Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning , 2020, OSDI.

[39]  Andrea Lodi,et al.  MIPLIB 2010 , 2011, Math. Program. Comput..

[40]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41]  Lars Schmidt-Thieme,et al.  Hyp-RL : Hyperparameter Optimization by Reinforcement Learning , 2019, ArXiv.

[42]  Erik Elmroth,et al.  A2L2: An Application Aware Flexible HPC Scheduling Model for Low-Latency Allocation , 2015, VTDC@HPDC.