Dynamic Warp Resizing in High-Performance SIMT

Modern GPUs synchronize threads grouped in a warp at every instruction. These results in improving SIMD efficiency and makes sharing fetch and decode resources possible. The number of threads included in each warp (or warp size) affects divergence, synchronization overhead and the efficiency of memory access coalescing. Small warps reduce the performance penalty associated with branch and memory divergence at the expense of a reduction in memory coalescing. Large warps enhance memory coalescing significantly but also increase branch and memory divergence. Dynamic workload behavior, including branch/memory divergence and coalescing, is an important factor in determining the warp size returning best performance. Optimal warp size can vary from one workload to another or from one program phase to the next. Based on this observation, we propose Dynamic Warp Resizing (DWR). DWR takes innovative microarchitectural steps to adjust warp size during runtime and according to program characteristics. DWR outperforms static warp size decisions, up to 1.7X to 2.28X, while imposing less than 1% area overhead. We investigate various alternative configurations and show that DWR performs better for narrower SIMD and larger caches.

[1]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[2]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[3]  Margaret Martonosi,et al.  Stargazer: Automated regression-based GPU design space exploration , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[4]  Sylvain Collange Stack-less SIMT reconvergence at low cost , 2011 .

[5]  Scott A. Mahlke,et al.  PEPSC: A Power-Efficient Processor for Scientific Computing , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[6]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[9]  Amirali Baniasadi,et al.  Performance in GPU Architectures: Potentials and Distances , 2011 .

[10]  Matei Ripeanu,et al.  Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[12]  Kevin Skadron,et al.  Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[13]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.