A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

Public cloud GPU clusters are becoming emerging platforms for training distributed deep learning jobs. Under this training paradigm, the job scheduler is a crucial component to improve user experiences, i.e., reducing training fees and job completion time, which can also save power costs for service providers. However, the scheduling problem is known to be NP-hard. Most existing work divides it into two easier sub-tasks, i.e., ordering task and placement task, which are responsible for deciding the scheduling orders of jobs and placement orders of GPU machines, respectively. Due to the superior adaptation ability, learning-based policies can generally perform better than traditional heuristic-based methods. Nevertheless, there are still two main challenges that have not been well-solved. First, most learning-based methods only focus on ordering or placement policy independently, while ignoring their cooperation. Second, the unbalanced machine performances and resource contention impose huge overhead and uncertainty on job duration, but rarely be considered in existing work. To tackle these issues, this paper presents a dual-agent scheduler framework abstracted from the two sub-tasks to jointly learn the ordering and placement policies and make better-informed scheduling decisions. Specifically, we design an ordering agent with a scalable squeeze-and-communicate strategy for better cooperation; for the placement agent, we propose a novel Random Walk Gaussian Process to learn the performance similarities of GPU machines while being aware of the uncertain performance fluctuation. Finally, the dual-agent is jointly optimized with multi-agent reinforcement learning. Extensive experiments conducted on the real-world production cluster trace demonstrate the superiority of our model.

[1]  Yanghua Peng,et al.  Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads , 2023, IEEE/ACM Transactions on Networking.

[2]  Huiqun Yu,et al.  Uncertainty‐aware scheduling of real‐time workflows under deadline constraints on multi‐cloud systems , 2022, Concurr. Comput. Pract. Exp..

[3]  Owen Lockwood,et al.  A Review of Uncertainty for Deep Reinforcement Learning , 2022, AIIDE.

[4]  Zhen Xiao,et al.  Fast and Fine-grained Autoscaler for Streaming Jobs with Reinforcement Learning , 2022, IJCAI.

[5]  Zhaoyun Chen RIFLING: A reinforcement learning‐based GPU scheduler for deep learning research and development platforms , 2021, Softw. Pract. Exp..

[6]  Zhen Xiao,et al.  Analysis of Resource Management Methods Based on Reinforcement Learning , 2021, 2021 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS).

[7]  Yonggang Wen,et al.  Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs , 2021, SoCC.

[8]  Shengen Yan,et al.  Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Wayne Xin Zhao,et al.  Learning Reliable User Representations from Volatile and Sparse Data to Accurately Predict Customer Lifetime Value , 2021, KDD.

[10]  Volker Tresp,et al.  Quantifying Predictive Uncertainty in Medical Image Analysis with Deep Kernel Learning , 2021, 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI).

[11]  Rajkumar Buyya,et al.  Deep Reinforcement Learning-based Methods for Resource Scheduling in Cloud Computing: A Review and Future Directions , 2021, Artif. Intell. Rev..

[12]  Yu Wang,et al.  The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games , 2021, NeurIPS.

[13]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14]  Tomi Westerlund,et al.  Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey , 2020, 2020 IEEE Symposium Series on Computational Intelligence (SSCI).

[15]  Fayçal Belkaid,et al.  A multi-objective simulated annealing to solve an identical parallel machine scheduling problem with deterioration effect and resources consumption constraints , 2020, J. Comb. Optim..

[16]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[17]  Yao Hu,et al.  Uncertainty Aware Graph Gaussian Process for Semi-Supervised Learning , 2020, AAAI.

[18]  Hangyu Mao,et al.  Learning multi-agent communication with double attentional deep reinforcement learning , 2020, Autonomous Agents and Multi-Agent Systems.

[19]  Ali Diabat,et al.  A novel hybrid antlion optimization algorithm for multi-objective task scheduling problems in cloud computing environments , 2020, Cluster Computing.

[20]  Zhen Xiao,et al.  Learning Agent Communication under Limited Bandwidth by Message Pruning , 2019, AAAI.

[21]  Christopher P. Reale,et al.  Multivariate Uncertainty in Deep Learning , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[22]  Wei Lin,et al.  DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters , 2019, IEEE Transactions on Parallel and Distributed Systems.

[23]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[24]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[25]  Rob Fergus,et al.  Modeling Others using Oneself in Multi-Agent Reinforcement Learning , 2018, ICML.

[26]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[27]  Joel Z. Leibo,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning , 2017, ArXiv.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[30]  Srikanth Kandula,et al.  Resource Management with Deep Reinforcement Learning , 2016, HotNets.

[31]  Andrew Gordon Wilson,et al.  Stochastic Variational Deep Kernel Learning , 2016, NIPS.

[32]  Mirta Galesic,et al.  Social learning strategies modify the effect of network structure on group performance , 2016, Nature Communications.

[33]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[36]  Xiangfeng Wang,et al.  Asynchronous Distributed ADMM for Large-Scale Optimization—Part II: Linear Convergence Analysis and Numerical Performance , 2015, IEEE Transactions on Signal Processing.

[37]  Uwe Schwiegelshohn,et al.  Towards Understanding Uncertainty in Cloud Computing Resource Provisioning , 2015, ICCS.

[38]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[41]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[42]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[43]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[44]  David J. Fleet,et al.  Efficient Optimization for Sparse Gaussian Process Regression , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[46]  Dan Wang,et al.  A Task Scheduling Algorithm for Hadoop Platform , 2013, J. Comput..

[47]  Devavrat Shah,et al.  Iterative ranking from pair-wise comparisons , 2012, NIPS.

[48]  David B. Dunson,et al.  Multiresolution Gaussian Processes , 2012, NIPS.

[49]  Chen Jing,et al.  A dynamic and integrated load-balancing scheduling algorithm for Cloud datacenters , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[50]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[51]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[52]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[53]  Dylan F. Williams,et al.  Covariance-Based Vector-Network-Analyzer Uncertainty Analysis for Time- and Frequency-Domain Measurements , 2010, IEEE Transactions on Microwave Theory and Techniques.

[54]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[55]  Karsten M. Borgwardt,et al.  Graph Kernels , 2008, J. Mach. Learn. Res..

[56]  E. Rolls,et al.  Cerebral Cortex Advance Access published June 22, 2007 Expected Value, Reward Outcome, and Temporal Difference Error Representations in a Probabilistic Decision Task , 2022 .

[57]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[58]  Claudia V. Goldman,et al.  Solving Transition Independent Decentralized Markov Decision Processes , 2004, J. Artif. Intell. Res..

[59]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[60]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[61]  Luigi V. Mancini,et al.  Fault-Tolerant Rate-Monotonic First-Fit Scheduling in Hard-Real-Time Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[62]  G. D'Agostini,et al.  On the use of the covariance matrix to fit correlated data , 1994 .

[63]  Michael L. Littman,et al.  Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[64]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[65]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[66]  Joseph Y.-T. Leung,et al.  Complexity of Scheduling Parallel Task Systems , 1989, SIAM J. Discret. Math..

[67]  M. Kac,et al.  An Explicit Representation of a Stationary Gaussian Process , 1947 .

[68]  Yong Li,et al.  MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters , 2022, NSDI.

[69]  Jilles Vreeken,et al.  SUSAN: The Structural Similarity Random Walk Kernel , 2021, SDM.

[70]  Ion Stoica,et al.  Caerus: NIMBLE Task Scheduling for Serverless Analytics , 2021, NSDI.

[71]  Shengen Yan,et al.  ASTRAEA: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters , 2021, IEEE Transactions on Parallel and Distributed Systems.

[72]  Wencong Xiao,et al.  AntMan: Dynamic Scaling on GPU Clusters for Deep Learning , 2020, OSDI.

[73]  Tanja Hueber,et al.  Gaussian Processes For Machine Learning , 2016 .

[74]  K. Perez Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment , 2014 .

[75]  Jimeng Sun,et al.  Fast Random Walk Graph Kernel , 2012, SDM.

[76]  Markus Neuhäuser,et al.  Wilcoxon Signed Rank Test , 2006 .

[77]  P. Pardalos,et al.  Pareto optimality, game theory and equilibria , 2008 .

[78]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[79]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[80]  Marco Wiering,et al.  Multi-Agent Reinforcement Learning for Traffic Light control , 2000 .

[81]  Uwe Schwiegelshohn,et al.  Analysis of first-come-first-serve parallel job scheduling , 1998, SODA '98.

[82]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .