论文信息 - Gandiva: Introspective Cluster Scheduling for Deep Learning - 字舞流文

Gandiva: Introspective Cluster Scheduling for Deep Learning

We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster. One key characteristic of deep learning is feedback-driven exploration, where a user often runs a set of jobs (or a multi-job) to achieve the best result for a specific mission and uses early feedback on accuracy to dynamically prioritize or kill a subset of jobs; simultaneous early feedback on the entire multi-job is critical. A second characteristic is the heterogeneity of deep learning jobs in terms of resource usage, making it hard to achieve best-fit a priori. Gandiva addresses these two challenges by exploiting a third key characteristic of deep learning: intra-job predictability, as they perform numerous repetitive iterations called mini-batch iterations. Gandiva exploits intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering low-latency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency. We show via a prototype implementation and micro-benchmarks that Gandiva can speed up hyper-parameter searches during deep learning by up to an order of magnitude, and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-to-resource fit. We also show that, in a real workload of jobs running in a 180-GPU cluster, Gandiva improves aggregate cluster utilization by 26%, pointing to a new way of managing large GPU clusters for deep learning.

Wencong Xiao | Fan Yang | Nipun Kwatra | Ramachandran Ramjee | Lidong Zhou | Hanyu Zhao | Zhenhua Han | Quanlu Zhang | Xuan Peng | Muthian Sivathanu | Romil Bhardwaj | Pratyush Patel | Pratyush Patel | Lidong Zhou | Wencong Xiao | Romil Bhardwaj | R. Ramjee | Muthian Sivathanu | Nipun Kwatra | Zhenhua Han | Xuan Peng | Hanyu Zhao | Quanlu Zhang | Fan Yang

[1] Andrew S. Tanenbaum,et al. Distributed operating systems , 2009, CSUR.

[2] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[5] Michael Isard,et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[6] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Andrew V. Goldberg,et al. Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[8] Yoshua Bengio,et al. Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[9] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[10] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[12] Diederik P. Kingma,et al. Stochastic Gradient VB and the Variational Auto-Encoder , 2013 .

[13] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[14] Wei Lin,et al. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[15] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[16] Chris Eliasmith,et al. Hyperopt: a Python library for model selection and hyperparameter optimization , 2015 .

[17] Tim Kraska,et al. Automating model search for large scale machine learning , 2015, SoCC.

[18] Marc'Aurelio Ranzato,et al. Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[19] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[20] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21] Frank Hutter,et al. Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.

[22] Daniel Rueckert,et al. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[24] Ying Zhang,et al. On Multiplicative Integration with Recurrent Neural Networks , 2016, NIPS.

[25] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[28] Srikanth Kandula,et al. This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[29] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[30] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[31] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[32] Eric A. Brewer,et al. Borg, Omega, and Kubernetes , 2016, ACM Queue.

[33] Robert N. M. Watson,et al. Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.

[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[35] Ali Farhadi,et al. Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[36] K. R. Jayaram,et al. Scalable Multi-Framework Multi-Tenant Lifecycle Management of Deep Learning Training Jobs , 2017 .

[37] Jürgen Schmidhuber,et al. Recurrent Highway Networks , 2016, ICML.

[38] Olatunji Ruwase,et al. HyperDrive: exploring hyperparameters with POP scheduling , 2017, Middleware.

[39] Ameet Talwalkar,et al. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[40] Aaron Klein,et al. Learning Curve Prediction with Bayesian Neural Networks , 2016, ICLR.

[41] D. Sculley,et al. Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[42] Michael J. Freedman,et al. SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[43] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Seetharami R. Seelam,et al. Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[45] Quan Chen,et al. Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers , 2017, ASPLOS.

[46] Carlo Luschi,et al. Revisiting Small Batch Training for Deep Neural Networks , 2018, ArXiv.

[47] Wencong Xiao,et al. Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications , 2018 .

[48] Chuan Wu,et al. Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[49] Elliot Meyerson,et al. Evolving Deep Neural Networks , 2017, Artificial Intelligence in the Age of Neural Networks and Brain Computing.