论文信息 - A Novel High Performance and Energy Efficient NUCA Architecture for STT-MRAM LLCs With Thermal Consideration

A Novel High Performance and Energy Efficient NUCA Architecture for STT-MRAM LLCs With Thermal Consideration

As the speed gap of the modern processor and the off-chip main memory enlarges, on-chip cache capacity increases to sustain the performance scaling. As a result, the cache power occupies a large portion of the total power budget. Spin transfer torque magnetic memory (STT-MRAM) is proposed as a promising solution for the low power cache design due to its high integration density and ultralow leakage power. Nevertheless, the high write power and latency of STT-MRAM become new barriers for the commercialization of this emerging technology. In this paper, we investigate the thermal effect on the access performance of STT-MRAM, and observe that the temperature can affect the write delay and energy significantly. Then, we explore the nonuniform cache access (NUCA) design of the chip-multiprocessors with STT-MRAM-based last level cache (LLC). A thermal aware data migration policy, called “Thermosiphon,” which takes advantage of the thermal property of STT-MRAM, is proposed to reduce the LLC write energy. This policy splits the LLC into different regions dynamically based on the thermal distribution monitored by thermal sensors available on-chip, and adaptively migrates write intensive data among different thermal regions considering the thermal gradient. Compared to the conventional NUCA design, our proposed design can save 41.2% write energy at most and 13.01% on average with negligible hardware overhead.

[1] Jaeyoung Park,et al. Variation-Tolerant Write Completion Circuit for Variable-Energy Write STT-RAM Architecture , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2] Wenqing Wu,et al. Multi retention level STT-RAM cache designs with a dynamic refresh scheme , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[4] Li Shang,et al. ThermalScope: Multi-scale thermal analysis for nanometer-scale integrated circuits , 2008, 2008 IEEE/ACM International Conference on Computer-Aided Design.

[5] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[6] Yu Wang,et al. PS3-RAM: A Fast Portable and Scalable Statistical STT-RAM Reliability/Energy Analysis Method , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7] C.-P. Wan,et al. Temperature dependence modeling for MOS VLSI circuit simulation , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[8] Cong Xu,et al. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[9] Liang Shi,et al. Migration-aware loop retiming for STT-RAM based hybrid cache for embedded systems , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[10] Jiang Jiang,et al. Understanding How Non-uniform Distribution of Memory Accesses on Cache Sets Affects the System Performance of Chip Multiprocessors , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications Workshops.

[11] David A. Wood,et al. Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[12] Timothy Mattson,et al. A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[13] A. Fert,et al. Field-free switching of a perpendicular magnetic tunnel junction through the interplay of spin–orbit and spin-transfer torques , 2018, Nature Electronics.

[14] Maheshkumar P Jagtap. Era of Multi-Core Processors , 2009 .

[15] Alan J. Weger,et al. Thermal-aware task scheduling at the system software level , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[16] Puneet Gupta,et al. MTJ variation monitor-assisted adaptive MRAM write , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[17] Kevin Skadron,et al. HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18] Arijit Raychowdhury,et al. Design space and scalability exploration of 1T-1STT MTJ memory arrays in the presence of variability and disturbances , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[19] Jun Yang,et al. Energy reduction for STT-RAM using early write termination , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[20] Ying Wang,et al. STT-RAM Buffer Design for Precision-Tunable General-Purpose Neural Network Accelerator , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[21] Huawei Li,et al. A Low Overhead In-Network Data Compressor for the Memory Hierarchy of Chip Multiprocessors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[22] Wei Lu,et al. Memristive devices for stochastic computing , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[23] Zhaohao Wang,et al. Write Energy Optimization for STT-MRAM Cache with Data Pattern Characterization , 2018, 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[24] Shih-Hung Chen,et al. Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..

[25] Cong Xu,et al. Building energy-efficient multi-level cell STT-MRAM based cache through dynamic data-resistance encoding , 2014, Fifteenth International Symposium on Quality Electronic Design.

[26] Jose Renau,et al. Characterizing processor thermal behavior , 2010, ASPLOS XV.

[27] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[28] Ryan N. Rakvic,et al. Replacement techniques for dynamic NUCA cache designs on CMPs , 2013, The Journal of Supercomputing.

[29] Huawei Li,et al. VANUCA: Enabling Near-Threshold Voltage Operation in Large-Capacity Cache , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[30] Dilip Krishnaswamy,et al. PROMETHEUS: A Proactive Method for Thermal Management of Heterogeneous MPSoCs , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[31] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[32] Michael Gschwind,et al. IBM POWER8 processor core microarchitecture , 2015, IBM J. Res. Dev..

[33] Yiran Chen,et al. Coordinating prefetching and STT-RAM based last-level cache management for multicore systems , 2013, GLSVLSI '13.

[34] Yen-Chen Liu,et al. Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[35] Lirida A. B. Naviner,et al. Compact model of magnetic tunnel junction with stochastic spin transfer torque switching for reliability analyses , 2014, Microelectron. Reliab..

[36] A. Fert,et al. Current-induced magnetization switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance , 2017, Nature Communications.

[37] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[38] Aida Todri,et al. Temperature Impact Analysis and Access Reliability Enhancement for 1T1MTJ STT-RAM , 2016, IEEE Transactions on Reliability.

[39] Mehdi Baradaran Tahoori,et al. Asynchronous Asymmetrical Write Termination (AAWT) for a low power STT-MRAM , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[40] Jih-Kwon Peir,et al. Content-Aware Non-Volatile Cache Replacement , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).