Scheduling CPU for GPU-based Deep Learning Jobs
暂无分享,去创建一个
Deep learning (DL) is popular in data-center as an important workload for artificial intelligence. With the recent breakthrough of using graphics accelerators and the popularity of DL framework, GPU server cluster dominates DL training in current practice. Cluster scheduler simply treats DL jobs as black-boxes and allocates GPUs as per job request specified by a user. However, other resources, e.g. CPU, are often allocated with workload-agnostic approaches. Kubeflow[1] performs heuristic static CPU resource assignment based on task types (e.g., worker, parameter-server), while [2] evenly divides CPUs of a server to each GPU. Despite the traditional impression that GPU is critical in DL, our observation suggests that the importance of CPU is undervalued. Identifying an appropriate CPU core number in a heterogeneous cluster is challenging yet performance critical to DL jobs. The diverse CPU usage characteristic is not well recognized in the following three aspects. Heterogeneous CPU demand across jobs. Although powered by GPU accelerators, different workloads exhibit tremendous gap on CPU demand. Figure la illustrates the required CPU cores to reach the maximal training performance for different workloads. Overfeat and Resnet50 require 40 and 7 cores for V100 respectively. Moreover, different workloads are not equally sensitive given insufficient resources offer. The training speed of Overfeat reduces 45% from 14 cores to 7 cores, however, Resnet50 only reduces 20%. Therefore, sacrificing the performance of Resnet50 is more cost-effective than Overfeat under insufficient resources scenarios. Better GPU, more CPU. Another key insight from Figure la is that, with better GPU allocated, the more CPUs are required. Moreover, with better GPU, Overfeat requires much more CPUs to maximize the performance comparing with Resnet50, showing different sensitivities. DL frameworks (e.g., Tensorflow) tend to overlap the computation in CPU (e.g., data pre-processing) and GPU (e.g., convolution) to maximize the resource utilization. With better GPU allocated, the latency of GPU operators reduces. Relatively it makes the latency of CPU operations become notable, calling for more CPUs. Furthermore, in contrast to the slowdown of CPU scaling, hardware accelerators (e.g., GPU) develop fast, which advocates carefully assignment of CPU resources for coordinating execution. Waving demand over time. DL training is feedback driven exploration that introduces periodically training and model validation switching. For some sequence-to-sequence models, such as text summarization, the validation on generated output of trained model requires computation efforts different from training. Figure lb illustrates profiling for neural machine translation (NMT) tasks with 1 GPU and 4 CPU cores allocated. The CPU and GPU utilization are in cyclic variation. In training, 4 cores are sufficient as the average CPU utilization is only 104%. However, in validation, only 8% for GPU utilization while 387% for CPU utilization. The latency is bounded in CPU. We further increase the CPU resources to 24 cores, resulting in 75% the validation time reduction. To address the CPU resource scheduling challenges, we present SAD, to maximize the cluster throughput with coarse-grained periodical rescheduling over an optimal experiment design based performance predictor. SAD exhibits adaptive characteristic-aware features to automatically infer appropriate CPU resources for allocation. Through lightweight profiling and continual monitoring, SAD captures the inter-job and intra-job resource demand heterogeneity of DL. The performance predictor in SAD can accurately suggest DL jobs training speed for different CPU numbers across various GPUs. Our small trace preliminary result shows that SAD improves the overall utilization by 19% while reduces the job completion time by 34% comparing with workload-agnostic allocation.