Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

As one of the most popular accelerators, Graphics Processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers in desktop and supercomputer systems. However, software power optimization method targeted for GPU has not been well studied. In this work, we propose kernel fusion method to reduce energy consumption and improve power efficiency on GPU architecture. Through fusing two or more independent kernels, kernel fusion method achieves higher utilization and much more balanced demand for hardware resources, which provides much more potential for power optimization, such as dynamic voltage and frequency scaling (DVFS). Basing on the CUDA programming model, this paper also gives several different fusion methods targeted for different situations. In order to make judicious fusion strategy, we deduce the process of fusing multiple independent kernels as a dynamic programming problem, which could be well solved with many existing tools and be simply embedded into compiler or runtime system. To reduce the overhead introduced by kernel fusion, we also propose effective method to reduce the usage of shared memory and coordinate the thread space of the kernels to be fused. Detailed experimental evaluation validates that the proposed kernel fusion method could reduce energy consumption without performance loss for several typical kernels.

[1]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[2]  Chen Ding,et al.  The energy impact of aggressive loop fusion , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[3]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[4]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[5]  Dimitrios S. Nikolopoulos,et al.  Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes , 2008, IEEE Transactions on Parallel and Distributed Systems.

[6]  Keqin Li,et al.  Performance Analysis of Power-Aware Task Scheduling Algorithms on Multiprocessor Computers with Dynamic Voltage and Speed , 2008, IEEE Transactions on Parallel and Distributed Systems.

[7]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[9]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[10]  Xuejun Yang,et al.  Energy-Oriented OpenMP Parallel Loop Scheduling , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[11]  Wen-mei W. Hwu,et al.  Program optimization carving for GPU computing , 2008, J. Parallel Distributed Comput..

[12]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[13]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  Tao Tang,et al.  Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+ , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[16]  Kevin Skadron,et al.  Enabling Task Parallelism in the CUDA Scheduler , 2009 .

[17]  Kai Ma,et al.  Temperature-constrained power control for chip multiprocessors with online model estimation , 2009, ISCA '09.

[18]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[19]  Arnaud Tisserand,et al.  Power Consumption of GPUs from a Software Perspective , 2009, ICCS.

[20]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[21]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.