论文信息 - Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency

Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction, but a combination of scalarization and STSIMT yields a mean performance enhancement of 19.6% and improves the energy-delay product by 26.2% compared to SIMT.

Ben H. H. Juurlink | Mauricio Alvarez | Jan Lucas | Michael Andersch

[1] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[2] Yi Yang,et al. Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement , 2013, ICS '13.

[3] Fernando Magno Quintão Pereira,et al. Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[4] Jeffrey R. Diamond,et al. Arbitrary Modulus Indexing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[5] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[7] Kunle Olukotun,et al. Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[8] James E. Smith,et al. Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9] Sylvain Collange,et al. Identifying scalar behavior in CUDA kernels , 2011 .

[10] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[11] Nicolas Brunie,et al. Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[12] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .

[16] Christopher Torng,et al. Microarchitectural mechanisms to exploit value structure in SIMT architectures , 2013, ISCA.

[17] Biao Wang,et al. Parallel H.264/AVC Motion Compensation for GPUs Using OpenCL , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[18] Tor M. Aamodt,et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[19] Dong Hyuk Woo,et al. SIMD divergence optimization through intra-warp compaction , 2013, ISCA.

[20] Mike Mantor,et al. AMD Radeon™ HD 7970 with graphics core next (GCN) architecture , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).

[21] William J. Dally,et al. Stream register files with indexed access , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[22] William J. Dally,et al. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[23] Matei Ripeanu,et al. StoreGPU: exploiting graphics processing units to accelerate distributed storage systems , 2008, HPDC '08.

[24] Biao Wang,et al. An Optimized Parallel IDCT on Graphics Processing Units , 2012, Euro-Par Workshops.

[25] Yi Yang,et al. Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[26] Christopher Batten,et al. Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..

[27] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[28] Baojin Wang,et al. State-of-Charge Estimation for Lithium-Ion Batteries Based on a Nonlinear Fractional Model , 2017, IEEE Transactions on Control Systems Technology.

[29] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[30] Krste Asanovic,et al. Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).