论文信息 - Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality

Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality

Hundreds of cores per chip and support for fine-grain multithreading have made GPUs a central player in today's HPC world. For many applications, however, achieving a high fraction of peak on current GPUs, still requires significant programmer effort. A key consideration for optimizing GPU code is determining a suitable amount of work to be performed by each thread. Thread granularity not only has a direct impact on occupancy but can also influence data locality at the register and shared-memory levels. This paper describes a software framework to analyze dependencies in parallel GPU threads and perform source-level restructuring to obtain GPU kernels with varying thread granularity. The framework supports specification of coarsening factors through source-code annotation and also implements a heuristic based on estimated register pressure that automatically recommends coarsening factors for improved memory performance. We present preliminary experimental results on a select set of CUDA kernels. The results show that the proposed strategy is generally able to select profitable coarsening factors. More importantly, the results demonstrate a clear need for automatic control of thread granularity at the software level for achieving higher performance.

[1] John Cavazos,et al. Optimizing and Auto-tuning Belief Propagation on the GPU , 2010, LCPC.

[2] Richard W. Vuduc,et al. Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Xipeng Shen,et al. A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[5] Apan Qasem,et al. Exploring the Optimization Space of Dense Linear Algebra Kernels , 2008, LCPC.

[6] João Correia Lopes,et al. High Performance Computing for Computational Science - VECPAR 2010 - 9th International conference, Berkeley, CA, USA, June 22-25, 2010, Revised Selected Papers , 2011, VECPAR.

[7] Keith D. Cooper,et al. Effective partial redundancy elimination , 1994, PLDI '94.

[8] Naga K. Govindaraju,et al. High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Jack J. Dongarra,et al. A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[10] Allen,et al. Optimizing Compilers for Modern Architectures , 2004 .

[11] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[12] J. Ramanujam,et al. Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[13] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15] Satoshi Matsuoka,et al. Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16] Patricia J. Teller,et al. Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[17] Jack J. Dongarra,et al. Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[18] Ken Kennedy,et al. Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[19] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[20] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[21] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[22] Justin P. Haldar,et al. Accelerating iterative field-compensated MR image reconstruction on GPUs , 2010, 2010 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[23] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[24] P. Sadayappan,et al. Optimal loop unrolling for GPGPU programs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[25] Ron Cytron,et al. What's In a Name? -or- The Value of Renaming for Parallelism Detection and Storage Allocation , 1987, ICPP.

[26] Hans Werner Meuer,et al. Top500 Supercomputer Sites , 1997 .

[27] N.K. Govindaraju,et al. A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[28] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..