Identifying scalar behavior in CUDA kernels

We propose a compiler analysis pass for programs expressed in the Single Program, Multiple Data (SPMD) programming model. It identifies statically several kinds of regular patterns that can occur between adjacent threads, including common computations, memory accesses at consecutive locations or at the same location and uniform control flow. This knowledge can be exploited by SPMD compilers targeting SIMD architectures. We present a compiler pass developed within the Ocelot framework that performs this analysis on NVIDIA CUDA programs at the PTX intermediate language level. Results are compared with optima obtained by simulation of several sets of CUDA benchmarks.

[1]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[2]  Sudhakar Yalamanchili,et al.  Technical Report : GIT-CERCS-09-06 A Characterization and Analysis of GPGPU Kernels , 2009 .

[3]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[4]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[5]  Mendel Rosenblum,et al.  Stream programming on general-purpose processors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[6]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[8]  Gregory Diamos The Design and Implementation Ocelot’s Dynamic Binary Translator from PTX to Multi-Core x86 , 2009 .

[9]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[10]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[12]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[13]  John Wawrzynek,et al.  Vector microprocessors , 1998 .

[14]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[15]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[16]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[17]  Yao Zhang,et al.  Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.

[18]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).