Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems, making it easy to optimize for objectives in a heterogeneity-aware way, while also being cognizant of performance optimizations like space sharing. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy. Gavel's heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5x compared to heterogeneity-agnostic policies.

[1]  T. Ibaraki,et al.  THE MULTIPLE-CHOICE KNAPSACK PROBLEM , 1978 .

[2]  Dimitri P. Bertsekas,et al.  Data Networks , 1986 .

[3]  Jean-Yves Le Boudec,et al.  A Unified Framework for Max-Min and Min-Max Fairness With Applications , 2007, IEEE/ACM Transactions on Networking.

[4]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[5]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[7]  Emmanuel J. Candès,et al.  Matrix Completion With Noise , 2009, Proceedings of the IEEE.

[8]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[9]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[10]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[11]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[12]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[13]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[14]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[15]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[18]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[19]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[20]  Mor Harchol-Balter,et al.  TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters , 2016, EuroSys.

[21]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[22]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[25]  Alexander Wong,et al.  Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video , 2017, ArXiv.

[26]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[27]  Xiaobo Zhou,et al.  Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization , 2017, USENIX Annual Technical Conference.

[28]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[29]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[30]  Amar Phanishayee,et al.  Accelerating Deep Learning Workloads Through Efficient Multi-Model Execution , 2018 .

[31]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[32]  Abdallah Moussawi Towards Large Scale Training Of Autoencoders For Collaborative Filtering , 2018, ArXiv.

[33]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[34]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[35]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[36]  Wencong Xiao,et al.  Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.

[37]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[38]  Xiao Sun,et al.  Fair Allocation of Heterogeneous and InterchangeableResources , 2019, PERV.

[39]  Hongzi Mao,et al.  Learning scheduling algorithms for data processing clusters , 2018, SIGCOMM.

[40]  Kunle Olukotun,et al.  Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark , 2018, ACM SIGOPS Oper. Syst. Rev..

[41]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[42]  Nipun Kwatra,et al.  Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning , 2020, EuroSys.

[43]  Tan N. Le,et al.  AlloX: compute allocation in hybrid clusters , 2020, EuroSys.

[44]  S. Venkataraman,et al.  Themis: Fair and Efficient GPU Cluster Scheduling , 2019, NSDI.