A Case for a Flexible Scalar Unit in SIMT Architecture

The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMT style instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.

[1]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[3]  Mattan Erez,et al.  Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation , 2013, ISCA.

[4]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[5]  James E. Smith,et al.  Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[7]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[8]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[9]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, HiPC 2008.

[10]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[11]  Christopher Batten,et al.  The Vector-Thread Architecture , 2004, ISCA 2004.

[12]  Matthew Mattina,et al.  Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[13]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[14]  David A. Patterson,et al.  Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[15]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[17]  Mateo Valero,et al.  Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[18]  Rajeev Balasubramonian,et al.  Dynamically allocating processor resources between nearby and distant ILP , 2001, ISCA 2001.

[19]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[20]  Yi Yang,et al.  Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[21]  Christopher Batten,et al.  Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..

[22]  Christoforos E. Kozyrakis,et al.  Overcoming the limitations of conventional vector processors , 2003, ISCA '03.

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Mateo Valero,et al.  Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[25]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[26]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[27]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[28]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[29]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Hsien-Hsin S. Lee,et al.  COMPASS: a programmable data prefetcher using idle GPU shaders , 2010, ASPLOS XV.

[31]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[32]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[33]  Yao Zhang,et al.  Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.

[34]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).