Performance Degradation Analysis of GPU Kernels

Hardware accelerators (currently Graphical Processing Units or GPUs) are an important component in many existing high-performance computing solutions [5]. Their growth in variety and usage is expected to skyrocket [1] due to many reasons. First, GPUs offer impressive energy efficiencies [3]. Second, when properly programmed, they yield impressive speedups by allowing programmers to model their computation around many fine-grained threads whose focus can be rapidly switched during memory stalls. Unfortunately, arranging for high memory access efficiency requires developed computational thinking to properly decompose a problem domain to gain this efficiency. Our work currently addresses the needs of the CUDA [5] approach to programming GPUs. Two important classes of such rules are bank conflict avoidance rules that pertain to CUDA shared memory and coalesced access rules that pertain to global memory. The former requires programmers to generate memory addresses from consecutive threads that fall within separate shared memory banks. The latter requires programmers to generate memory addresses that permit coalesced fetches from the global memory. In previous work [6], we had, to some extent addressed the former through SMTbased methods. Several other efforts also address bank conflicts [7, 8, 4]. In this work, we address the latter requirement—detecting when coalesced access rules are being violated.