论文信息 - GPU Scheduling for Short Tasks in Private Cloud

GPU Scheduling for Short Tasks in Private Cloud

GPUs are usually very expensive and not easily affordable by individuals. Therefore, GPU sharing is necessary to lower cost and avoid GPU idling among a group of users. Unlike jobs in production environments, which often last for days or weeks, the running time of programs in development and testing environments tend to be much shorter. Assigning a separate GPU to a person for development always leads to idling of the GPU. Therefore, for economic reasons, researchers usually share a small number of GPUs for development, especially in some small teams or labs. Users hope to automatically lease and release GPUs and get job responses as soon as possible. Current GPU sharing approaches either do not have good support for multiple users, or not designed to work effectively for such cases. This paper proposes a GPU-sharing method among multiple users for short GPU tasks. We implement a container-based batch computing system, which accepts and executes users' jobs through container images and specified configurations. A shortest-job-first based scheduling policy is used to ensure the priority of the short tasks and to prevent long tasks from starving. Evaluation demonstrate that our proposed method is effective and the system has a low overhead.

Bo An | Yan Li | Donggang Cao | Jialun Shao | Junming Ma

[1] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[2] Benjamin Hindman,et al. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[3] Yaozu Dong,et al. A Full GPU Virtualization Solution with Mediated Pass-Through , 2014, USENIX Annual Technical Conference.

[4] Zibin Zheng,et al. Collaboration environment for JointCloud computing , 2017 .

[5] Randy H. Katz,et al. Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[6] Gregory R. Ganger,et al. Stratus: cost-aware container scheduling in the public cloud , 2018, SoCC.

[7] Chuan Wu,et al. Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.