With GPU architectures becoming increasingly important due to their large number of parallel processors, NVIDIA’s CUDA environment is becoming widely used to support general purpose applications. To efficiently use the parallel processing power, programmers need to efficiently parallelize and map their algorithms. The difficulty of this task leads to the idea to investigate CUDA’s compiler. Part of the compiler in the CUDA tool-chain is entirely undocumented, as is its output. To draw conclusions on the behaviour of this compiler, the resulting object code is reverse engineered. A visualization tool is introduced, analyzing the previously unknown compiler behaviour and proving helpful to improve the mapping process for the programmer. These improvements focus on the area of register allocation and instruction reordering. This paper describes an extension to the CUDA tool-chain, providing programmers with a visualization of register life ranges. Also, the paper presents guidelines describing how to apply optimizations in order to obtain a lower register pressure. In a case-study example, performance increases by 33% compared to already optimized CUDA code. This is achieved by optimizing the code with the help of the introduced visualization tool. Also, in 11 other case-study examples, register pressure is reduced by an average of 18%. The presented guidelines could be added to the compiler to enable a similar register pressure reduction to be achieved automatically at compile-time for new and existing CUDA programs.
[1]
Preston Briggs,et al.
Register allocation via graph coloring
,
1992
.
[2]
Tor M. Aamodt,et al.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
,
2007,
40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[3]
Erik Lindholm,et al.
NVIDIA Tesla: A Unified Graphics and Computing Architecture
,
2008,
IEEE Micro.
[4]
Avi Mendelson,et al.
Many-Core vs. Many-Thread Machines: Stay Away From the Valley
,
2009,
IEEE Computer Architecture Letters.
[5]
Wen-mei W. Hwu,et al.
CUDA-Lite: Reducing GPU Programming Complexity
,
2008,
LCPC.
[6]
David Geer.
Vendors Upgrade Their Physics Processing to Improve Gaming
,
2006,
Computer.
[7]
Mark J. Harris.
Mapping computational concepts to GPUs
,
2005,
SIGGRAPH Courses.
[8]
de G. Haan.
Digital video post processing
,
2006
.
[9]
Hyesoon Kim,et al.
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
,
2009,
ISCA '09.
[10]
Jens H. Krüger,et al.
GPGPU: general purpose computation on graphics hardware
,
2004,
SIGGRAPH '04.
[12]
Andrew Kerr,et al.
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
,
2009
.
[13]
Kevin Skadron,et al.
A performance study of general-purpose applications on graphics processors using CUDA
,
2008,
J. Parallel Distributed Comput..