Warp size impact in GPUs: large or small?

There are a number of design decisions that impact a GPU's performance. Among such decisions deciding the right warp size can deeply influence the rest of the design. Small warps reduce the performance penalty associated with branch divergence at the expense of a reduction in memory coalescing. Large warps enhance memory coalescing significantly but also increase branch divergence. This leaves designers with two choices: use small warps and invest in finding new solutions to enhance coalescing or use large warps and address branch divergence employing effective control-flow solutions. In this work our goal is to investigate the answer to this question. We analyze warp size impact on memory coalescing and branch divergence. We use our findings to study two machines: a GPU using small warps but equipped with excellent memory coalescing (SW+) and a GPU using large warps but employing an MIMD engine immune from control-flow costs (LW+). Our evaluations show that building coalescing-enhanced small warp GPUs is a better approach compared to pursuing a control-flow enhanced large warp GPU.

[1]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[2]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[3]  Sanjay J. Patel,et al.  Tradeoffs in designing accelerator architectures for visual computing , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[4]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[7]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[9]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Nicolas Brunie,et al.  Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[12]  Margaret Martonosi,et al.  Stargazer: Automated regression-based GPU design space exploration , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[13]  Scott A. Mahlke,et al.  PEPSC: A Power-Efficient Processor for Scientific Computing , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[14]  Amirali Baniasadi,et al.  Performance in GPU Architectures: Potentials and Distances , 2011 .

[15]  Matei Ripeanu,et al.  Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.