Architecting the last-level cache for GPUs using STT-MRAM nonvolatile memory

The key to high performance on graphics processor units (GPUs) is the massive threading that helps GPUs hide memory access latency with maximum thread-level parallelism (TLP). Although, increasing the TLP and the number of cores does not result in enhanced performance because of thread contention for memory resources such as last-level cache. The future GPUs will have larger last-level cache (L2 in GPU), based on the current trends in VLSI technology and GPU architectures toward increasing the number of processing cores. Larger L2 caches inevitably have larger power consumption. In this chapter, having investigated the behavior of general-purpose GPU (GPGPU) applications, we present an efficient L2 cache architecture for GPUs based on solid transfer-torque RAM (STT-RAM) technology. Because of its high density and low power characteristics, STT-RAM technology can be utilized in GPUs where numerous cores leave a limited area for on-chip memory banks. They have, however, two important issues, high energy and latency of write operations, that have to be addressed. Low-retention time STT-RAMs can reduce the energy and delay of write operations. Nevertheless, employing STT-RAMs with low-retention time in GPUs requires a thorough study on the behavior of GPGPU applications. The STT-RAM L2 cache architecture proposed in this chapter can improve IPC by up to 171% (20%, on average), and reduce the average consumed power by 28.9% compared to a conventional L2 cache architecture with an equal on-chip area.

[1]  Cong Xu,et al.  Bandwidth-aware reconfigurable cache design with hybrid memory technologies , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[2]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[4]  Mohammad Arjomand,et al.  An efficient STT-RAM last level cache architecture for GPUs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Jun Yang,et al.  Energy reduction for STT-RAM using early write termination , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[6]  Benjamin C. Lee,et al.  Disintegrated control for energy-efficient and heterogeneous memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[7]  Xiaoxia Wu,et al.  Power and performance of read-write aware Hybrid Caches with non-volatile memories , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[8]  Yiran Chen,et al.  Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[9]  Yiming Huai,et al.  Spin-Transfer Torque MRAM (STT-MRAM): Challenges and Prospects , 2008 .

[10]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[11]  Yu Wang,et al.  A STT-RAM-based low-power hybrid register file for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[12]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[13]  Yuan Xie,et al.  i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Yiran Chen,et al.  Cache coherence enabled adaptive refresh for volatile STT-RAM , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Mohammad Arjomand,et al.  Architecting the Last-Level Cache for GPUs using STT-RAM Technology , 2015, ACM Trans. Design Autom. Electr. Syst..

[16]  Chita R. Das,et al.  Cache revive: Architecting volatile STT-RAM caches for enhanced performance in CMPs , 2012, DAC Design Automation Conference 2012.

[17]  Yuan Xie,et al.  3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis , 2009, 2009 IEEE International Conference on Computer Design.

[18]  Mohamed Zahran,et al.  Efficient utilization of GPGPU cache hierarchy , 2015, GPGPU@PPoPP.

[19]  Hisashi Shima,et al.  Resistive Random Access Memory (ReRAM) Based on Metal Oxides , 2010, Proceedings of the IEEE.

[20]  Yuan Xue,et al.  Prolonging PCM lifetime through energy-efficient, segment-aware, and wear-resistant page allocation , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[21]  Yiran Chen,et al.  Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[22]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Kaushik Roy,et al.  STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[24]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[26]  Mohammad Arjomand,et al.  Variable Resistance Spectrum Assignment in Phase Change Memory Systems , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[27]  Karthikeyan Sankaralingam,et al.  Challenge benchmarks that must be conquered to sustain the gpu revolution , 2011 .

[28]  Amin Jadidi,et al.  High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[29]  Wen-mei W. Hwu,et al.  Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications , 2010, International Journal of Parallel Programming.

[30]  Majid Sarrafzadeh,et al.  A memory optimization technique for software-managed scratchpad memory in GPUs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[31]  Sudhakar Yalamanchili,et al.  Harmonia: Balancing compute and memory power in high-performance GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[32]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[33]  Yu Wang,et al.  Hi-fi playback: Tolerating position errors in shift operations of racetrack memory , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[34]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[35]  Doe Hyun Yoon,et al.  Flexible cache error protection using an ECC FIFO , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[36]  Mohammad Arjomand,et al.  A Reliable 3D MLC PCM Architecture with Resistance Drift Predictor , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[37]  Norman P. Jouppi,et al.  FREE-p: Protecting non-volatile memory against both hard and soft errors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[38]  Tao Li,et al.  Power-performance co-optimization of throughput core architecture using resistive memory , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[39]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[40]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[41]  Kevin Skadron,et al.  The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Wenqing Wu,et al.  Multi retention level STT-RAM cache designs with a dynamic refresh scheme , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Jing-Yang Jou,et al.  Cache Capacity Aware Thread Scheduling for Irregular Memory Access on many-core GPGPUs , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[44]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[45]  Mircea R. Stan,et al.  Relaxing non-volatility for fast and energy-efficient STT-RAM caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[46]  Mohammad Arjomand,et al.  Reducing access latency of MLC PCMs through line striping , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[47]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[48]  Kinam Kim,et al.  Bi-layered RRAM with unlimited endurance and extremely uniform switching , 2011, 2011 Symposium on VLSI Technology - Digest of Technical Papers.

[49]  Won Woo Ro,et al.  Warped-Compression: Enabling power efficient GPUs through register compression , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[50]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[51]  Yi Yang,et al.  Shared memory multiplexing: A novel way to improve GPGPU throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[52]  Yiran Chen,et al.  An efficient STT-RAM-based register file in GPU architectures , 2015, The 20th Asia and South Pacific Design Automation Conference.

[53]  Yuan Xie,et al.  A Write-Aware STTRAM-Based Register File Architecture for GPGPU , 2015, ACM J. Emerg. Technol. Comput. Syst..

[54]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[55]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[56]  Jinwoo Shin,et al.  DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function , 2012, IEEE Computer Architecture Letters.

[57]  Sudhakar Yalamanchili,et al.  An energy efficient cache design using Spin Torque Transfer (STT) RAM , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[58]  Yiran Chen,et al.  Performance, Power, and Reliability Tradeoffs of STT-RAM Cell Subject to Architecture-Level Requirement , 2011, IEEE Transactions on Magnetics.

[59]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[60]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[61]  Carole-Jean Wu,et al.  CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[62]  Margaret Martonosi,et al.  Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.

[63]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.