Improving Energy Efficiency of GPUs through Data Compression and Compressed Execution

GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. As a result register file consumes a large fraction of the total GPU chip power. This paper explores register file data compression for GPUs to improve power efficiency. Compression reduces the width of the register file read and write operations, which in turn reduces dynamic power. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Compression exploits the value similarity by removing data redundancy of register values. Without decompressing operand values some instructions can be processed inside register file, which enables to further save energy by minimizing data movement and processing in power hungry main execution unit. Evaluation results show that the proposed techniques save 25 percent of the total register file energy consumption and 21 percent of the total execution unit energy consumption with negligible performance impact.

[1]  Mark R. Nelson,et al.  LZW data compression , 1989 .

[2]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[3]  Trevor N. Mudge,et al.  Improving code density using compression techniques , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[4]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[5]  Jun Yang,et al.  Frequent Value Locality and Value-Centric Data Cache Design , 2000, ASPLOS.

[6]  Hubertus Franke,et al.  Memory Expansion Technology (MXT): Software support and performance , 2001, IBM J. Res. Dev..

[7]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[8]  David A. Wood,et al.  Adaptive cache compression for high-performance processors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[9]  Mateo Valero,et al.  A content aware integer register file organization , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[10]  David A. Wood,et al.  Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches , 2004 .

[11]  Hiroshi Nakamura,et al.  A small, fast and low-power register file by bit-partitioning , 2005, 11th International Symposium on High-Performance Computer Architecture.

[12]  Aviral Shrivastava,et al.  Bypass aware instruction scheduling for register file power reduction , 2006 .

[13]  Chita R. Das,et al.  Performance and power optimization through data compression in Network-on-Chip architectures , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[14]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[15]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[17]  Yao Zhang,et al.  Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.

[18]  Xi Chen,et al.  C-Pack: A High-Performance Microprocessor Cache Compression Algorithm , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[19]  Yunsi Fei,et al.  Register file partitioning and recompilation for register file power reduction , 2010, TODE.

[20]  Bevan M. Baas,et al.  Design of an energy-efficient 32-bit adder operating at subthreshold voltages in 45-nm CMOS , 2010, International Conference on Communications and Electronics 2010.

[21]  William J. Dally,et al.  A compile-time managed multi-level register file hierarchy , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[23]  Mark Horowitz,et al.  Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[24]  Chia-Lin Yang,et al.  Power gating strategies on GPUs , 2011, TACO.

[25]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[26]  Sylvain Collange,et al.  Affine Vector Cache for memory bandwidth savings , 2011 .

[27]  Nam Sung Kim,et al.  Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Onur Mutlu,et al.  Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[29]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[31]  Zhongliang Chen,et al.  Characterizing scalar opportunities in GPGPU applications , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[32]  Ben H. H. Juurlink,et al.  How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[33]  Mohammad Abdel-Majeed,et al.  Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[34]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[35]  Christopher Torng,et al.  Microarchitectural mechanisms to exploit value structure in SIMT architectures , 2013, ISCA.

[36]  Rong Ge,et al.  Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU , 2013, 2013 42nd International Conference on Parallel Processing.

[37]  Mohammad Abdel-Majeed,et al.  Warped gates: Gating aware scheduling and power gating for GPGPUs , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Hiroshi Nakamura,et al.  Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[39]  Nam Sung Kim,et al.  Power-efficient computing for compute-intensive GPGPU applications , 2013, HPCA.

[40]  Qunfeng Dong,et al.  A Case for a Flexible Scalar Unit in SIMT Architecture , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[41]  Waleed Dweik,et al.  Warped-Shield: Tolerating Hard Faults in GPGPUs , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[42]  Samira Manabi Khan,et al.  Last-level cache deduplication , 2014, ICS '14.

[43]  Somayeh Sardashti,et al.  Decoupled Compressed Cache: Exploiting Spatial Locality for Energy Optimization , 2014, IEEE Micro.

[44]  Murali Annavaram,et al.  PATS: Pattern aware scheduling and power gating for GPGPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[45]  Rajeev Balasubramonian,et al.  MemZip: Exploring unconventional benefits from memory compression , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[46]  Per Stenström,et al.  SC2: A statistical compression cache scheme , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[47]  Won Woo Ro,et al.  Warped-Compression: Enabling power efficient GPUs through register compression , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[48]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[49]  Joel Emer,et al.  SASSIFI : Evaluating Resilience of GPU Applications , 2015 .