Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters

With the recent widespread adoption of deep learning (DL) in academia and industry, more attention are attracted by DL platform, which can support research and development (R&D) of AI firms, institutes and universities. Towards an off-the-shelf distributed GPU cluster, prior work propose prediction-based schedulers to allocate resources for diverse DL workloads. However, the prediction-based schedulers have disadvantages on prediction accuracy and offline-profiling costs. In this paper, we propose a learning-based scheduler, which models the scheduling problem as a reinforcement learning problem, achieving minimum average job completion time and maximum system utilization. The scheduler contains the designs of state space, action space, reward function and update scheme. Furthermore, we will evaluate our proposed scheduler implemented as a plugin of Tensorflow on real cluster and large-scale simulation.

[1]  Dan Pei,et al.  Reducing Web Latency Through Dynamically Setting TCP Initial Window with Reinforcement Learning , 2018, 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).

[2]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[3]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[4]  Jie Yu,et al.  GENIE: QoS-guided Dynamic Scheduling for CNN-based Tasks on SME Clusters , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  Zongpeng Li,et al.  Online Job Scheduling in Distributed Machine Learning Clusters , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.