Understanding and optimizing packed neural network training for hyper-parameter tuning

As neural networks are increasingly employed in machine learning practice, organizations will have to determine how to share limited training resources among a diverse set of model training tasks. This paper studies jointly training multiple neural network models on a single GPU. We presents an empirical study of this operation, called pack, and end-to-end experiments that suggest significant improvements for hyperparameter search systems. Our research prototype is in TensorFlow, and we evaluate performance across different models (ResNet, MobileNet, DenseNet, and MLP) and training scenarios. The results suggest: (1) packing two models can bring up to 40% performance improvement over unpacked setups for a single training step and the improvement increases when packing more models; (2) the benefit of a pack primitive largely depends on a number of factors including memory capacity, chip architecture, neural network structure, and batch size; (3) there exists a trade-off between packing and unpacking when training multiple neural network models on limited resources; (4) a pack-based Hyperband is up to 2.7x faster than the original Hyperband training method in our experiment setting, with this improvement growing as memory size increases and subsequently the density of models packed.

[1]  Wencong Xiao,et al.  Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.

[2]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[3]  Ameet Talwalkar,et al.  Exploiting Reuse in Pipeline-Aware Hyperparameter Tuning , 2019, ArXiv.

[4]  Amar Phanishayee,et al.  Accelerating Deep Learning Workloads Through Efficient Multi-Model Execution , 2018 .

[5]  Bingsheng He,et al.  Efficient Memory Management for GPU-based Deep Learning Systems , 2019, ArXiv.

[6]  T. S. Eugene Ng,et al.  Green, Yellow, Yield: End-Host Traffic Scheduling for Distributed Deep Learning with TensorLights , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Hai Jin,et al.  Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures , 2018, ACM Trans. Archit. Code Optim..

[8]  Tian Li,et al.  Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads , 2017, Proc. VLDB Endow..

[9]  Mor Harchol-Balter,et al.  TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters , 2016, EuroSys.

[10]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[12]  Hidehito Yabuuchi,et al.  Low-latency job scheduling with preemption for the development of deep learning , 2019, OpML.

[13]  Li Chen,et al.  Cynthia: Cost-Efficient Cloud Resource Provisioning for Predictable Distributed Deep Neural Network Training , 2019, ICPP.

[14]  Ameet Talwalkar,et al.  Massively Parallel Hyperparameter Tuning , 2018, ArXiv.

[15]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[16]  Taro Sekiyama,et al.  Profile-guided memory optimization for deep neural networks , 2018, ArXiv.

[17]  Rui Liu,et al.  Artificial Intelligence in Resource-Constrained and Shared Environments , 2019, ACM SIGOPS Oper. Syst. Rev..

[18]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[19]  Zenglin Xu,et al.  Superneurons: dynamic GPU memory management for training deep neural networks , 2018, PPoPP.

[20]  Robert M. Patton,et al.  FLEET: Flexible Efficient Ensemble Training for Heterogeneous Deep Neural Networks , 2020, MLSys.

[21]  Kiyokuni Kawachiya,et al.  TFLMS: Large Model Support in TensorFlow by Graph Rewriting , 2018, ArXiv.

[22]  Abdul Wasay,et al.  Rapid Training of Very Large Ensembles of Diverse Neural Networks , 2018, ArXiv.

[23]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[24]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[25]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[26]  Amar Phanishayee,et al.  Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads , 2019, NSDI.

[27]  Ameet Talwalkar,et al.  A System for Massively Parallel Hyperparameter Tuning , 2020, MLSys.

[28]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Matthias Weidlich,et al.  Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers , 2019, Proc. VLDB Endow..

[30]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[31]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[32]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[33]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[34]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[35]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[36]  Oren Etzioni,et al.  Green AI , 2019, Commun. ACM.

[37]  Jung-Woo Ha,et al.  CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms , 2018, ArXiv.

[38]  Mosharaf Chowdhury,et al.  Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications , 2019, MLSys.

[39]  Ion Stoica,et al.  HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline , 2019, SoCC.

[40]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[41]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[42]  Hyeontaek Lim,et al.  3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning , 2018, MLSys.

[43]  Amar Phanishayee,et al.  Themis: Fair and Efficient GPU Cluster Scheduling , 2020, NSDI.

[44]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.