ContextPreRF: Enhancing the Performance and Energy of GPUs With Nonuniform Register Access

Register files are a key data storage unit that impacts instruction throughput for graphics processing units (GPUs). Typically, GPU register files are quite large to accommodate many concurrent threads and are implemented using the same SRAM technology as the on-chip cache. We propose contextrf, a new register file architecture that efficiently leverages register files with nonuniform access characteristics, including hybrid SRAM/DRAM (S/D) and spintronic domain-wall memories (DWMs). Contextrf allows greater-capacity register files to be implemented in the same area within the GPU, with reduced power consumption. We also propose contextPreRF, a hardware preswitch scheme to hide switching delays-as soon as a register request is queued, the nonuniform access memories containing the corresponding register are sent a preemptive switch request. Thus, our scheme transparently hides the penalties of switching between register contexts. After replacing the register file SRAM with S/D, we can reduce energy by 37%, with a 1.4% average performance drop. Employing DWM, we reduce register file energy by 74%, with a 0.4% average performance penalty. For the denser DWM, we model converting the saved area into additional registers, cache, and shared memory-this improves performance by 13.5% over the baseline SRAM register file.

[1]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[2]  G. Edward Suh,et al.  SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[3]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[4]  Yiran Chen,et al.  C1C: A configurable, compiler-guided STT-RAM L1 cache , 2013, TACO.

[5]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Kaushik Roy,et al.  DWM-TAPESTRI - An energy efficient all-spin cache using domain wall shift based writes , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Shunsuke Fukami,et al.  Micromagnetic analysis of current driven domain wall motion in nanostrips with perpendicular magnetic anisotropy , 2008 .