A Novel High Performance and Energy Efficient NUCA Architecture for STT-MRAM LLCs With Thermal Consideration

As the speed gap of the modern processor and the off-chip main memory enlarges, on-chip cache capacity increases to sustain the performance scaling. As a result, the cache power occupies a large portion of the total power budget. Spin transfer torque magnetic memory (STT-MRAM) is proposed as a promising solution for the low power cache design due to its high integration density and ultralow leakage power. Nevertheless, the high write power and latency of STT-MRAM become new barriers for the commercialization of this emerging technology. In this paper, we investigate the thermal effect on the access performance of STT-MRAM, and observe that the temperature can affect the write delay and energy significantly. Then, we explore the nonuniform cache access (NUCA) design of the chip-multiprocessors with STT-MRAM-based last level cache (LLC). A thermal aware data migration policy, called “Thermosiphon,” which takes advantage of the thermal property of STT-MRAM, is proposed to reduce the LLC write energy. This policy splits the LLC into different regions dynamically based on the thermal distribution monitored by thermal sensors available on-chip, and adaptively migrates write intensive data among different thermal regions considering the thermal gradient. Compared to the conventional NUCA design, our proposed design can save 41.2% write energy at most and 13.01% on average with negligible hardware overhead.

[1]  Jaeyoung Park,et al.  Variation-Tolerant Write Completion Circuit for Variable-Energy Write STT-RAM Architecture , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Wenqing Wu,et al.  Multi retention level STT-RAM cache designs with a dynamic refresh scheme , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[4]  Li Shang,et al.  ThermalScope: Multi-scale thermal analysis for nanometer-scale integrated circuits , 2008, 2008 IEEE/ACM International Conference on Computer-Aided Design.

[5]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[6]  Yu Wang,et al.  PS3-RAM: A Fast Portable and Scalable Statistical STT-RAM Reliability/Energy Analysis Method , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7]  C.-P. Wan,et al.  Temperature dependence modeling for MOS VLSI circuit simulation , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[8]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[9]  Liang Shi,et al.  Migration-aware loop retiming for STT-RAM based hybrid cache for embedded systems , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[10]  Jiang Jiang,et al.  Understanding How Non-uniform Distribution of Memory Accesses on Cache Sets Affects the System Performance of Chip Multiprocessors , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications Workshops.

[11]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[12]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[13]  A. Fert,et al.  Field-free switching of a perpendicular magnetic tunnel junction through the interplay of spin–orbit and spin-transfer torques , 2018, Nature Electronics.

[14]  Maheshkumar P Jagtap Era of Multi-Core Processors , 2009 .

[15]  Alan J. Weger,et al.  Thermal-aware task scheduling at the system software level , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[16]  Puneet Gupta,et al.  MTJ variation monitor-assisted adaptive MRAM write , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  Arijit Raychowdhury,et al.  Design space and scalability exploration of 1T-1STT MTJ memory arrays in the presence of variability and disturbances , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[19]  Jun Yang,et al.  Energy reduction for STT-RAM using early write termination , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[20]  Ying Wang,et al.  STT-RAM Buffer Design for Precision-Tunable General-Purpose Neural Network Accelerator , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21]  Huawei Li,et al.  A Low Overhead In-Network Data Compressor for the Memory Hierarchy of Chip Multiprocessors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[22]  Wei Lu,et al.  Memristive devices for stochastic computing , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[23]  Zhaohao Wang,et al.  Write Energy Optimization for STT-MRAM Cache with Data Pattern Characterization , 2018, 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[24]  Shih-Hung Chen,et al.  Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..

[25]  Cong Xu,et al.  Building energy-efficient multi-level cell STT-MRAM based cache through dynamic data-resistance encoding , 2014, Fifteenth International Symposium on Quality Electronic Design.

[26]  Jose Renau,et al.  Characterizing processor thermal behavior , 2010, ASPLOS XV.

[27]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[28]  Ryan N. Rakvic,et al.  Replacement techniques for dynamic NUCA cache designs on CMPs , 2013, The Journal of Supercomputing.

[29]  Huawei Li,et al.  VANUCA: Enabling Near-Threshold Voltage Operation in Large-Capacity Cache , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[30]  Dilip Krishnaswamy,et al.  PROMETHEUS: A Proactive Method for Thermal Management of Heterogeneous MPSoCs , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[31]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[32]  Michael Gschwind,et al.  IBM POWER8 processor core microarchitecture , 2015, IBM J. Res. Dev..

[33]  Yiran Chen,et al.  Coordinating prefetching and STT-RAM based last-level cache management for multicore systems , 2013, GLSVLSI '13.

[34]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[35]  Lirida A. B. Naviner,et al.  Compact model of magnetic tunnel junction with stochastic spin transfer torque switching for reliability analyses , 2014, Microelectron. Reliab..

[36]  A. Fert,et al.  Current-induced magnetization switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance , 2017, Nature Communications.

[37]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[38]  Aida Todri,et al.  Temperature Impact Analysis and Access Reliability Enhancement for 1T1MTJ STT-RAM , 2016, IEEE Transactions on Reliability.

[39]  Mehdi Baradaran Tahoori,et al.  Asynchronous Asymmetrical Write Termination (AAWT) for a low power STT-MRAM , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[40]  Jih-Kwon Peir,et al.  Content-Aware Non-Volatile Cache Replacement , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).