Scalable Kernel Fusion for Memory-Bound GPU Applications

GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing data traffic to off-chip memory, kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different kernels. The main challenges are a) searching for the optimal kernel fusions while constrained by data dependencies and kernels' precedences and b) effectively applying kernel fusion to achieve speedup. This paper introduces a problem definition and proposes a scalable method for searching the space of possible kernel fusions to identify optimal kernel fusions for large problems. The paper also proposes a codeless performance upper-bound projection model to achieve effective fusions. Results show that using the proposed scalable method for kernel fusion improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x.

[1]  Brucek Khailany,et al.  CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  William C. Skamarock,et al.  A time-split nonhydrostatic atmospheric model for weather research and forecasting applications , 2008, J. Comput. Phys..

[3]  Georgi Gaydadjiev,et al.  Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline , 2012, International Journal of Parallel Programming.

[4]  Mohamed Wahib,et al.  A Grouped Genetic Algorithm for Optimizing GPU Kernel Fusion , 2014 .

[5]  Venkatram Vishwanath,et al.  Dataflow-driven GPU performance projection for multi-kernel transformations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[7]  Richard E. Korf,et al.  A new algorithm for optimal bin packing , 2002, AAAI/IAAI.

[8]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Satoshi Matsuoka,et al.  Multi-GPU Implementation of the NICAM Atmospheric Model , 2012, Euro-Par Workshops.

[10]  Cecelia DeLuca,et al.  The architecture of the Earth System Modeling Framework , 2003, Computing in Science & Engineering.

[11]  Peter Messmer,et al.  Accelerating Stencil-Based Computations by Increased Temporal Locality on Modern Multi- and Many-Core Architectures , 2008 .

[12]  Stephen A. Jarvis,et al.  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[13]  Ludek Matyska,et al.  Optimizing CUDA code by kernel fusion: application on BLAS , 2013, The Journal of Supercomputing.

[14]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[15]  Wen-mei W. Hwu,et al.  Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[16]  Mohamed Wahib,et al.  Highly optimized full GPU-acceleration of non-hydrostatic weather model SCALE-LES , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[18]  Pen-Chung Yew,et al.  Revisiting loop fusion in the polyhedral framework , 2014, PPoPP '14.

[19]  Satoshi Matsuoka,et al.  An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Kata Praditwong,et al.  Solving software module clustering problem by evolutionary algorithms , 2011, 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE).

[21]  Mark A. Taylor,et al.  Progress towards accelerating HOMME on hybrid multi-core systems , 2013, Int. J. High Perform. Comput. Appl..

[22]  Venkatram Vishwanath,et al.  GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  이세정 최적화를 위한 네트워크 서버 ( NEOS Server for Optimization ) , 2001 .

[24]  Markus Schordan,et al.  A Source-to-Source Architecture for User-Defined Optimizations , 2003, JMLC.

[25]  Emanuel Falkenauer,et al.  A hybrid grouping genetic algorithm for bin packing , 1996, J. Heuristics.

[26]  Sudhakar Yalamanchili,et al.  Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[27]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[28]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[29]  Yousuke Sato,et al.  Potential of Retrieving Shallow-Cloud Life Cycle from Future Generation Satellite Observations through Cloud Evolution Diagrams: A Suggestion from a Large Eddy Simulation , 2014 .

[30]  Jiri Filipovic,et al.  Automatic fusions of CUDA-GPU kernels for parallel map , 2011, CARN.