Characterization and Transformation of Unstructured Control Flow in GPU Applications

Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA and OpenCL. Although this technology is widely used, commodity GPUs use different schemes to implement it, and the performance limitations of these different schemes under real workloads are not well understood. This study identifies important classes of program control flows, and characterize their presence in real workloads. It is shown that most existing techniques handle structured control flow efficiently, some are incapable of executing unstructured control flow directly, and none handles unstructured control flow efficiently. A suggestion to reduce the impact of this problem is provided. An unstructured-to-structured control flow transformation is implemented and its performance impact on a large class of GPU applications is assessed. The results quantify the importance of improving support for programs with unstructured control flow on GPUs. The transformation can also be used as a JIT compiler pass to execute programs with unstructured control flow on the GPU devices that do not support unstructured control flow. This is an important capability for execution portability of applications using GPU accelerators.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Adam Levinthal,et al.  Chap - a SIMD graphics processor , 1984, SIGGRAPH.

[4]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[5]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[6]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Erik H. D'Hollander,et al.  Using hammock graphs to structure programs , 2004, IEEE Transactions on Software Engineering.

[8]  Timothy J. Harvey,et al.  AS imple, Fast Dominance Algorithm , 1999 .

[9]  David R. Kaeli,et al.  Caracal: dynamic translation of runtime environments for GPUs , 2011, GPGPU-4.

[10]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[11]  Tao Li,et al.  Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[12]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[13]  Randima Fernando,et al.  GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics , 2004 .

[14]  Ken Kennedy,et al.  AS imple, Fast Dominance Algorithm , 1999 .

[15]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Satoshi Matsuoka The Road to TSUBAME and Beyond , 2008 .

[17]  David K. McAllister,et al.  OptiX: a general purpose ray tracing engine , 2010, ACM Trans. Graph..

[18]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).