Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

[1]  Dror G. Feitelson,et al.  Packing Schemes for Gang Scheduling , 1996, JSSPP.

[2]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[3]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[4]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[5]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[6]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2007, SIGMOD '07.

[7]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[8]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[9]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[11]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[12]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[13]  Albert G. Greenberg,et al.  Scarlett: coping with skewed content popularity in mapreduce clusters , 2011, EuroSys '11.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[16]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[17]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[18]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[19]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[20]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[21]  Sameer Agarwal,et al.  Queries with Bounded Errors & Bounded Response Times on Very Large Data , 2014 .

[22]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[23]  Adam Wierman,et al.  This Paper Is Included in the Proceedings of the 11th Usenix Symposium on Networked Systems Design and Implementation (nsdi '14). Grass: Trimming Stragglers in Approximation Analytics Grass: Trimming Stragglers in Approximation Analytics , 2022 .

[24]  Ion Stoica,et al.  The Power of Choice in Data-Aware Cluster Scheduling , 2014, OSDI.

[25]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[26]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[27]  Djamel Djenouri,et al.  Distributed Low-Latency Data Aggregation Scheduling in Wireless Sensor Networks , 2015, ACM Trans. Sens. Networks.

[28]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[29]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[32]  Ion Stoica,et al.  iOLAP: Managing Uncertainty for Efficient Incremental OLAP , 2016, SIGMOD Conference.

[33]  Ioannis Mitliagkas,et al.  Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs , 2016, ArXiv.

[34]  Quan Chen,et al.  Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers , 2016, ASPLOS.

[35]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[37]  C. Rossbach,et al.  Full Virtualization for GPUs Reconsidered , 2017 .

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Paramvir Bahl,et al.  Live Video Analytics at Scale with Approximation and Delay-Tolerance , 2017, NSDI.

[40]  Xin Wang,et al.  DeepProf: Performance Analysis for Deep Learning Applications via Mining GPU Execution Patterns , 2017, ArXiv.

[41]  Nikil Dutt,et al.  Special session: quality-configurable memory hierarchy through approximation , 2017, 2017 International Conference on Compilers, Architectures and Synthesis For Embedded Systems (CASES).

[42]  Michael J. Freedman,et al.  SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[43]  Quan Chen,et al.  Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers , 2017, ASPLOS.

[44]  Purushottam Kulkarni Dynamic GPU Memory Management for Training Deep Neural Networks , 2018 .

[45]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[46]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[47]  Zenglin Xu,et al.  Superneurons: dynamic GPU memory management for training deep neural networks , 2018, PPoPP.

[48]  Brian Kingsbury,et al.  Kernel Approximation Methods for Speech Recognition , 2017, J. Mach. Learn. Res..