A Survey of Techniques for Architecting and Managing GPU Register File

To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify them along several parameters. The aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area.

[1]  Xin Fu,et al.  Hybrid CMOS-TFET based register files for energy-efficient GPGPUs , 2013, International Symposium on Quality Electronic Design (ISQED).

[2]  Minyi Guo,et al.  An energy-efficient and scalable eDRAM-based register file architecture for GPGPU , 2013, ISCA.

[3]  Sparsh Mittal,et al.  A Survey of Techniques for Managing and Leveraging Caches in GPUs , 2014, J. Circuits Syst. Comput..

[4]  Licheng Yu,et al.  Architecture supported register stash for GPGPU , 2016, J. Parallel Distributed Comput..

[5]  Ben H. H. Juurlink,et al.  How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Shunning Jiang,et al.  Bank stealing for conflict mitigation in GPGPU Register File , 2015, 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[7]  Tao Li,et al.  Power-performance co-optimization of throughput core architecture using resistive memory , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Yiran Chen,et al.  An efficient STT-RAM-based register file in GPU architectures , 2015, The 20th Asia and South Pacific Design Automation Conference.

[9]  Fangyang Shen,et al.  Modeling and characterizing GPGPU reliability in the presence of soft errors , 2013, Parallel Comput..

[10]  Yuan Xie,et al.  A Write-Aware STTRAM-Based Register File Architecture for GPGPU , 2015, ACM J. Emerg. Technol. Comput. Syst..

[11]  G. Edward Suh,et al.  SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[12]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[13]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[14]  Dong Li,et al.  A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile On-Chip Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[15]  Sparsh Mittal,et al.  A survey of architectural techniques for improving cache power efficiency , 2014, Sustain. Comput. Informatics Syst..

[16]  Yu Wang,et al.  A STT-RAM-based low-power hybrid register file for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[18]  Yao Lu,et al.  Compiler assisted dynamic register file in GPGPU , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[19]  J. Vetter,et al.  Exploring Design Space of 3 D NVM and eDRAM Caches Using DESTINY Tool , 2015 .

[20]  Jeffrey S. Vetter,et al.  A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[21]  Amey Karkare,et al.  The More We Share, The More We Have: Improving GPU performance through Register Sharing , 2015, ArXiv.

[22]  Nam Sung Kim,et al.  Power-efficient computing for compute-intensive GPGPU applications , 2012, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[23]  Mikko H. Lipasti,et al.  Precision-aware soft error protection for GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[24]  Xin Fu,et al.  Soft-error reliability and power co-optimization for GPGPUs register file using resistive memory , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25]  Hyeran Jeon,et al.  GPGPU Register File Management by Hardware Co-operated Register Reallocation , 2014 .

[26]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[27]  Won Woo Ro,et al.  Warped-Compression: Enabling power efficient GPUs through register compression , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[28]  Slo-Li Chu,et al.  An Adaptive Thread Scheduling Mechanism With Low-Power Register File for Mobile GPUs , 2014, IEEE Transactions on Multimedia.

[29]  Yongjun Park,et al.  An eDRAM-Based Approximate Register File for GPUs , 2016, IEEE Design & Test.

[30]  Yiran Chen,et al.  Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[31]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[32]  Mohammad Abdel-Majeed,et al.  Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[33]  William J. Dally,et al.  A compile-time managed multi-level register file hierarchy , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Jeffrey S. Vetter,et al.  Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing , 2015, Computing in Science & Engineering.

[35]  Onur Mutlu,et al.  Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[36]  Slo-Li Chu,et al.  An Energy-Efficient Unified Register File for Mobile GPUs , 2011, 2011 IFIP 9th International Conference on Embedded and Ubiquitous Computing.

[37]  Jeffrey S. Vetter,et al.  A Survey of Methods for Analyzing and Improving GPU Energy Efficiency , 2014, ACM Comput. Surv..

[38]  Ehsan Atoofian,et al.  Reducing shift penalty in Domain Wall Memory through register locality , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[39]  Yi Yang,et al.  Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement , 2013, ICS '13.

[40]  Rami G. Melhem,et al.  ContextPreRF: Enhancing the Performance and Energy of GPUs With Nonuniform Register Access , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[41]  MittalSparsh A Survey of Architectural Techniques for Managing Process Variation , 2016 .

[42]  Anne C. Elster,et al.  Register Caching for Stencil Computations on GPUs , 2014, 2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[43]  Sparsh Mittal,et al.  A Survey of Techniques for Approximate Computing , 2016, ACM Comput. Surv..

[44]  Nikil D. Dutt,et al.  ARGO: Aging-aware GPGPU register file allocation , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[45]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[46]  Amey Karkare,et al.  Improving GPU Performance Through Resource Sharing , 2015, HPDC.

[47]  Hyesoon Kim,et al.  Spare register aware prefetching for graph algorithms on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[48]  Sudhakar Yalamanchili,et al.  Power Modeling for GPU Architectures Using McPAT , 2014, TODE.

[49]  Xin Fu,et al.  Mitigating the Susceptibility of GPGPUs Register File to Process Variations , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.