Drowsy Register Files for Reducing GPU Leakage Energy

General-purpose graphics processing units (GPGPUs) usually employ a huge register file (RF) to support massive multithreading, which however, is also responsible for a large fraction of GPGPU’s total power. In this paper, we propose three RF drowsy policies and evaluate their effectiveness on leakage energy reduction. In the first drowsy policy called immediate sleep (Drowsy-IS), registers keep staying in the drowsy mode unless they are accessed, which are then put into the drowsy mode again immediately to minimize the leakage energy consumption. The second policy named temporary awake (Drowsy-TA) holds the registers in the normal (i.e. active) mode for a certain period after being accessed to wait for the next access. The registers are placed into the drowsy mode until that period expires without any access activity. Finally, we propose an adaptive policy named Drowsy-RI which identifies the re-access interval for each register at run-time and lets registers wait for the predicted intervals before putting them into the drowsy mode. The experimental results show that compared to the baseline RF, Drowsy-IS achieves 91.7% RF leakage energy reduction on average at the cost of 4.4% performance degradation. Drowsy-TA leads to negligible performance overhead, and 82.8% leakage energy reduction. By balancing the energy saving and the performance overhead, Drowsy-RI saves more RF leakage energy (87.3%) than Drowsy-TA and achieves less performance degradation (2.7%) than Drowsy-IS.

[1]  Hyesoon Kim,et al.  Spare register aware prefetching for graph algorithms on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[2]  David Blaauw,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, ISCA.

[3]  Sun UltraSPARC,et al.  A closer look at GPUs , 2008, Commun. ACM.

[4]  Slo-Li Chu,et al.  An Adaptive Thread Scheduling Mechanism With Low-Power Register File for Mobile GPUs , 2014, IEEE Transactions on Multimedia.

[5]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[6]  Nikil D. Dutt,et al.  ARGO: Aging-aware GPGPU register file allocation , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[7]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[8]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[9]  Mike Houston,et al.  A closer look at GPUs , 2008, Commun. ACM.

[10]  William J. Dally,et al.  Stream register files with indexed access , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[11]  Tao Li,et al.  Power-performance co-optimization of throughput core architecture using resistive memory , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[12]  Pushpak Karnick GPGPU : General Purpose Computing on Graphics Hardware Pushpak Karnick , 2007 .

[13]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  Wei Zhang,et al.  GPU Register Packing: Dynamically Exploiting Narrow-Width Operands to Improve Performance , 2017, 2017 IEEE Trustcom/BigDataSE/ICESS.

[15]  Wei Zhang,et al.  Reducing cache leakage energy for hybrid SPM-cache architectures , 2014, 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[16]  Massoud Pedram,et al.  Design and application of multimodal power gating structures , 2009, 2009 10th International Symposium on Quality Electronic Design.

[17]  Jeffrey S. Vetter,et al.  A Survey of Methods for Analyzing and Improving GPU Energy Efficiency , 2014, ACM Comput. Surv..

[18]  Ehsan Atoofian,et al.  Reducing shift penalty in Domain Wall Memory through register locality , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[19]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[20]  Mohammad Abdel-Majeed,et al.  Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[21]  Ehsan Atoofian Reducing Static and Dynamic Power of L1 Data  Caches in GPGPUs , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[22]  Ali Manzak,et al.  Power-Aware L1 and L2 Caches for GPGPUs , 2014, Euro-Par.

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).