Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads while ensuring overall cluster efficiency. We find that established cluster scheduling disciplines that provide instantaneous fair share of resources are a poor fit because of ML workloads' unique attributes. ML jobs are typically long running, have coarse grained tasks that need to be gang-scheduled, and their performance is sensitive to tasks' relative placement. These properties cannot be captured by existing fair sharing schemes. We propose Themis, a new scheduling framework for ML training workloads. It's GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off efficiency for fairness in the short term but compensating for finish-time fairness in the long term. Our evaluation on a number of machine learning models shows that Themis can ensure greater fairness while providing more efficient allocations compared to state-of-the-art schedulers.

[1]  R. Shreedhar,et al.  Efficient Fair Queuing Using Deficit Round - , 1997 .

[2]  Yann Chevaleyre,et al.  Issues in Multiagent Resource Allocation , 2006, Informatica.

[3]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[4]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[5]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[6]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[7]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[8]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[9]  Gagan Goel,et al.  Mechanism design for fair division: allocating divisible items without payments , 2013, EC.

[10]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[11]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[12]  Chris Eliasmith,et al.  Hyperopt: a Python library for model selection and hyperparameter optimization , 2015 .

[13]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Srikanth Kandula,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[16]  Aditya Akella,et al.  Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.

[17]  Olatunji Ruwase,et al.  HyperDrive: exploring hyperparameters with POP scheduling , 2017, Middleware.

[18]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[19]  Michael J. Freedman,et al.  SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[20]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[21]  Wencong Xiao,et al.  Multi-tenant GPU Clusters for Deep Learning Workloads: Analysis and Implications , 2018 .

[22]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[23]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.