An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization

Graphic processing units (GPUs) are composed of a group of single-instruction multiple data (SIMD) streaming multiprocessors (SMs). GPUs are able to efficiently execute highly data parallel tasks through SIMD execution on the SMs. However, if those threads take diverging control paths, all divergent paths are executed serially. In the worst case, every thread takes a different control path and the highly parallel architecture is used serially by each thread. This control flow divergence problem is well known in GPU development; code transformation, memory access redirection, and data layout reorganization are commonly used to reduce the impact of divergence. These techniques attempt to eliminate divergence by grouping together threads or data to ensure identical behavior. However, prior efforts using these techniques do not model the performance impact of any particular divergence or consider that complete elimination of divergence may not be possible. Thus, we perform analysis of the performance impact of divergence and potential thread regrouping algorithms that eliminate divergence or minimize the impact of remaining divergence. Finally, we develop a divergence optimization framework that analyzes and transforms the kernel at compile-time and regroups the threads at runtime. For the compute-bound applications, our proposed metrics achieve performance estimation accuracy within 6.2% of measured performance. Using these metrics, we develop thread regrouping algorithms, which consider the impact of divergence, and speed up these applications by $2.2 {\times }$ on average on NVIDIA GTX480.

[1]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[3]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[4]  Yun Liang,et al.  Efficient GPU Spatial-Temporal Multitasking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[5]  Yangdong Deng,et al.  Taming irregular EDA applications on GPUs , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[6]  Lin Ma,et al.  Analysis of classic algorithms on GPUs , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[7]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[8]  Dong Hyuk Woo,et al.  SIMD divergence optimization through intra-warp compaction , 2013, ISCA.

[9]  Shengkui Zhao,et al.  Real-time implementation and performance optimization of 3D sound localization on GPUs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Shane Ryoo,et al.  Program Optimization Strategies for Data-Parallel Many-Core Processors , 2008 .

[12]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[13]  Minh N. Do,et al.  A revisit to cost aggregation in stereo matching: How far can we reduce its computational redundancy? , 2011, 2011 International Conference on Computer Vision.

[14]  T. R. P. Siriwardena,et al.  Accelerating global sequence alignment using CUDA compatible multi-core GPU , 2010, 2010 Fifth International Conference on Information and Automation for Sustainability.

[15]  Mattan Erez,et al.  Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation , 2013, ISCA.

[16]  Yu Wang,et al.  Run-time technique for simultaneous aging and power optimization in GPGPUs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Hiren D. Patel,et al.  On the use of GP-GPUs for accelerating compute-intensive EDA applications , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[19]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[20]  Ali Akoglu,et al.  Sequence alignment with GPU: Performance and design challenges , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21]  Lin Ma,et al.  Theoretical analysis of classic algorithms on highly-threaded many-core GPUs , 2014, PPoPP '14.

[22]  Yun Liang,et al.  An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[24]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[25]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[26]  Yu Wang,et al.  Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[27]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[28]  Krste Asanovic,et al.  Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Mattan Erez,et al.  The dual-path execution model for efficient GPU control flow , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[30]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[31]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[32]  Yun Liang,et al.  Register and thread structure optimization for GPUs , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[33]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[34]  Dongrui Fan,et al.  Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).