Towards green GPUs: Warp size impact analysis

Selecting the right GPU configuration can impact the overall design in many ways. One of the critical parameters in a GPU is warp size. Smaller warps come with branch divergence reduction while larger warps provide better memory coalescing. In this work we are interested in two possible design choices and their impacts on GPUs: using small warps and investing in finding new solutions to enhance coalescing or using large warps and addressing branch divergence by employing effective control-flow solutions. We also analyze warp size impact on memory coalescing and branch divergence and hence energy. We use our findings to study two machines: a GPU using small warps but equipped with excellent memory coalescing (SW+) and a GPU using large warps but employing an MIMD engine immune from control-flow costs (LW+). We conclude that SW+ provides better energy efficiency when compared to LW+.

[1]  Sanjay J. Patel,et al.  Tradeoffs in designing accelerator architectures for visual computing , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[2]  Ahmad Khonsari,et al.  Dynamic warp resizing: Analysis and benefits in high-performance SIMT , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[3]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Ahmad Khonsari,et al.  Warp size impact in GPUs: large or small? , 2013, GPGPU@ASPLOS.

[6]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[8]  Nicolas Brunie,et al.  Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[9]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[10]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[11]  Andrew B. Kahng,et al.  ORION 2.0: A Power-Area Simulator for Interconnection Networks , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Amirali Baniasadi,et al.  Performance in GPU Architectures: Potentials and Distances , 2011 .

[14]  Matei Ripeanu,et al.  Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.