Balancing Scalar and Vector Execution on GPU Architectures

Graphics Processing Units (GPUs) have evolved to become high performance processors for general purpose data-parallel applications. Most GPU execution exploits a Single Instruction Multiple Data (SIMD) model. Typically, little attention is paid to whether the input data to the SIMD lanes are the same or different. We have observed that a significant number of SIMD instructions demonstrate scalar characteristics, i.e., they operate on the same data across their active lanes. Treating them as normal SIMD instructions results in redundant and inefficient GPU execution. To better serve both scalar and vector operations, we propose a novel scalar-vector GPU architecture. Our specialized scalar pipeline handles scalar instructions efficiently with only a single copy of the data, freeing the SIMD pipeline for normal vector execution. We propose a novel synchronization scheme to resolve data dependencies between scalar and vector instructions. With our optimized warp scheduling and instruction dispatching schemes, the scalar-vector GPU architecture achieves performance improvements of 19% on average in the Parboil and Rodinia benchmarks suites. We also examine the effects of varying warp sizes on scalar-vector execution and explore subwarp execution for power efficiency. Our results show that, on average, power is reduced by 18%.

[1]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[2]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[3]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Nam Sung Kim,et al.  Power-efficient computing for compute-intensive GPGPU applications , 2013, HPCA.

[6]  Yi Yang,et al.  Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement , 2013, ICS '13.

[7]  Zhongliang Chen,et al.  Scalar Waving: Improving the Efficiency of SIMD Execution on GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Sylvain Collange,et al.  Identifying scalar behavior in CUDA kernels , 2011 .

[9]  Zhongliang Chen,et al.  Characterizing scalar opportunities in GPGPU applications , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[11]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[12]  David Kaeli,et al.  Heterogeneous Computing with OpenCL 2.0 , 2015 .

[13]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[14]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Qunfeng Dong,et al.  A Case for a Flexible Scalar Unit in SIMT Architecture , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[17]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[18]  Yao Zhang,et al.  Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.

[19]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).