GPU Register Packing: Dynamically Exploiting Narrow-Width Operands to Improve Performance

Graphics processing units(GPUs) have been increasingly used to accelerate general purpose computations. By exploiting massive thread-level parallelism (TLP), GPUs can achieve high throughput as well as memory latency hiding. As a result, a very large register file (RF) is typically required to enable fast and low-cost context switching between tens of thousands of active threads. However, RF resource is still insufficient to enable all thread level parallelism and the lack of RF resources can hurt performance by limiting the occupancy of GPU threads. Moreover, if the available RF capacity can not fit the requirement of a thread block, GPU needs to fetch some variables from local memory which may lead to long memory access latencies. By observing that a large percentage of computed results actually have fewer significant bits compared to the full width of a 32-bit register for many GPGPU applications, we propose a GPU register packing scheme to dynamically exploit narrowwidth operands and pack multiple operands into a single fullwidth register. By using dynamically register packing, more RF space is available which allows GPU to enable more TLP through assigning additional thread blocks on SMs (Streaming Multiprocessors) and thus improve performance. The experimental results show that our GPU register packing scheme can achieve up to 1.96X speedup and 1.18X on average.

[1]  Margaret Martonosi,et al.  Dynamically exploiting narrow width operands to improve processor power and performance , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[2]  Seth Copen Goldstein,et al.  BitValue Inference: Detecting and Exploiting Narrow Bitwidth Computations , 2000, Euro-Par.

[3]  Kanad Ghose,et al.  Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[4]  William J. Dally,et al.  Stream register files with indexed access , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[5]  Hiroshi Nakamura,et al.  A small, fast and low-power register file by bit-partitioning , 2005, 11th International Symposium on High-Performance Computer Architecture.

[6]  Shuai Wang,et al.  In-Register Duplication: Exploiting Narrow-Width Value for Improving Register File Reliability , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[7]  Pushpak Karnick GPGPU : General Purpose Computing on Graphics Hardware Pushpak Karnick , 2007 .

[8]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[9]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[10]  Mike Houston,et al.  A closer look at GPUs , 2008, Commun. ACM.

[11]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[14]  Shuai Wang,et al.  On the Exploitation of Narrow-Width Values for Improving Register File Reliability , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[16]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[17]  Mohammad Abdel-Majeed,et al.  Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[18]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[19]  Nam Sung Kim,et al.  Power-efficient computing for compute-intensive GPGPU applications , 2013, HPCA.

[20]  Sudhakar Yalamanchili,et al.  Power Modeling for GPU Architectures Using McPAT , 2014, TODE.

[21]  Won Woo Ro,et al.  Warped-Compression: Enabling power efficient GPUs through register compression , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[22]  Xin Fu,et al.  Soft-error reliability and power co-optimization for GPGPUs register file using resistive memory , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[23]  Nam Sung Kim,et al.  Approximating warps with intra-warp operand value similarity , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Sun UltraSPARC,et al.  A closer look at GPUs , 2008, Commun. ACM.