论文信息 - Ai BCS: A GPU cluster scheduling optimization based on SKE model

Ai BCS: A GPU cluster scheduling optimization based on SKE model

SKE is more accurate than the other models.SKE can help us to choose the best solution of parallel program and optimize the thread-block configurations for the different types of GPU.SKE can help the scheduling algorithm of GPU cluster to reduce the subtask-migration of balance according to the current load. A GPGPU is very important technology and a research hotspot for cloud computing. We pay close attention to its energy consumption and performance. In this paper, a static performance analysis model of GPU, SKE (Single Kernel Estimate), is set up to analyze the completion time of the kernel function on a GPU to find the optimal parallel solution to different tasks in a specific GPU device and the granularity size of the thread-block division, thus enabling the fastest execution speed of the kernel function. The deviation between the completion time calculated by SKE and the real execution time of the kernel is no more than 13%. On this basis, we calculate the completion time for each sub-GPU task and seek the critical path of the GPU cluster, and propose a GPU cluster scheduling algorithm, BCS (Based on Critical-path-Scheduling). The algorithm regulates the frequency of non-critical nodes, mainly through dynamic voltage and frequency scaling (DVFS) technology, and achieves the goal of reducing the energy consumption of GPU nodes without affecting the final completion time of the cluster. The evaluation results show that BCS reduces energy consumption by a maximum of 9.4%, compared to DRS.

Bocheng Liu | Qingkui Chen | Jinjing Li | Liping Gao

[1] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[2] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.

[3] Massoud Pedram,et al. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times , 2005 .

[4] Yannis Cotronis,et al. A Practical Performance Model for Compute and Memory Bound GPU Kernels , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[5] Yue Wang,et al. An Instruction-Level Energy Estimation and Optimization Methodology for GPU , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[6] Lin Ma,et al. Performance modeling for highly-threaded many-core GPUs , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[7] Francisco de Sande,et al. Optimization strategies in different CUDA architectures using llCoMP , 2012, Microprocess. Microsystems.

[8] Wang Hai. Power Consumption Prediction Model of General-Purpose Computing GPU with Static Program Slicing , 2013 .

[9] David A. Bader,et al. A Waterfall Model to Achieve Energy Efficient Tasks Mapping for Large Scale GPU Clusters , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[10] William Gropp,et al. EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing , 2011, Computing in Science & Engineering.

[11] Xiao Qin,et al. EAD and PEBD: Two Energy-Aware Duplication Scheduling Algorithms for Parallel Tasks on Homogeneous Clusters , 2011, IEEE Transactions on Computers.

[12] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[13] Marwa Chouchene,et al. Optimized parallel implementation of face detection based on GPU component , 2015, Microprocess. Microsystems.

[14] Hai-Feng Wang,et al. Power Consumption Prediction Model of General-Purpose Computing GPU with Static Program Slicing: Power Consumption Prediction Model of General-Purpose Computing GPU with Static Program Slicing , 2014 .

[15] Bocheng Liu,et al. Implementation and optimization of intra prediction in H264 video parallel decoder on CUDA , 2012, 2012 IEEE Fifth International Conference on Advanced Computational Intelligence (ICACI).

[16] Hesham H. Ali,et al. Task scheduling in parallel and distributed systems , 1994, Prentice Hall series in innovative technology.

[17] Guibin Wang,et al. Power-Efficient Work Distribution Method for CPU-GPU Heterogeneous System , 2010, International Symposium on Parallel and Distributed Processing with Applications.

[18] Dharma P. Agrawal,et al. A task duplication based scheduling algorithm for heterogeneous systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[19] Lin Ma,et al. A Memory Access Model for Highly-threaded Many-core Architectures , 2012, ICPADS.

[20] Arnaud Tisserand,et al. Power Consumption of GPUs from a Software Perspective , 2009, ICCS.

[21] Guangming Liu,et al. A Performance Prediction Model for Memory-Intensive GPU Kernels , 2014, 2014 IEEE Symposium on Computer Applications and Communications.

[22] Lin Ma,et al. A Performance Model for Memory Bandwidth Constrained Applications on Graphics Engines , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.