Reducing branch divergence in GPU programs

Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch direction and delaying those that take the other direction until later iterations. Branch distribution reduces the length of divergent code by factoring out structurally similar code from the branch paths. We conduct a preliminary evaluation of the two optimizations using both synthetic benchmarks and a highly-optimized real-world application. Our evaluation shows that they improve the performance of the synthetic benchmarks by as much as 30% and 80% respectively, and that of the real-world application by 12% and 16% respectively.

[1]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[2]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[3]  Lothar Lilge,et al.  GPU-accelerated Monte Carlo simulation for photodynamic therapy treatment planning , 2009, European Conference on Biomedical Optics.

[4]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[5]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[6]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[7]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[8]  Benoît Meister,et al.  A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.

[9]  Bjorn De Sutter,et al.  Compiler techniques for code compaction , 2000, TOPL.

[10]  Xiaoming Li,et al.  A control-structure splitting optimization for GPGPU , 2009, CF '09.

[11]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[12]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[13]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[14]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[15]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[16]  Wen-mei W. Hwu,et al.  Program optimization carving for GPU computing , 2008, J. Parallel Distributed Comput..