Coordinating prefetching and STT-RAM based last-level cache management for multicore systems

Data prefetching is a common mechanism to mitigate the bottleneck of off-chip memory bandwidth in modern computing systems. Unfortunately, the side effects of prefetching are an additional burden on off-chip communication and increased cache write operations. With the proposal of spin-transfer torque random access memory (STT-RAM) based last-level caches (LLCs) for their high density and low power consumption, the increase of write pressure to the cache from prefetching coupled with the characteristically long write access compared with traditional SRAM caches exacerbates the performance cost of prefetching schemes. In this work, we propose two orthogonal techniques to reduce the negative performance impact induced by aggressive prefetching on multicore systems employing STT-RAM based LLC. First, basic priority assignment prioritizes the different types of access requests of LLC by their criticality and responds to them based on priority. Second, priority boosting differentiates requests by application and prioritizes the relatively few requests from applications with non-intensive accesses to the LLC, which usually creates the most severe performance degradation in multi-core systems. Combining these two prioritization policies can alleviate the negative effect induced by aggressive prefetching. Our results show that these techniques can achieve an 8.3 average application speedup compared to a baseline, prefetch only design without prioritization.

[1]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[3]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[4]  Yiran Chen,et al.  A novel architecture of the 3D stacked MRAM L2 cache for CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[5]  Xiaoxia Wu,et al.  Hybrid cache architecture with disparate memory technologies , 2009, ISCA '09.

[6]  M. Hosomi,et al.  A novel nonvolatile memory with spin torque transfer magnetization switching: spin-ram , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[7]  Xiaotong Zhuang,et al.  A hardware-based cache pollution filtering mechanism for aggressive prefetches , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[8]  Onur Mutlu,et al.  Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[9]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[10]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[11]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[12]  Onur Mutlu,et al.  Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  E. Belhaire,et al.  Macro-model of Spin-Transfer Torque based Magnetic Tunnel Junction device for hybrid Magnetic-CMOS design , 2006, 2006 IEEE International Behavioral Modeling and Simulation Workshop.

[14]  Jack Doweck,et al.  Inside Intel® Core microarchitecture , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[15]  Wei-Fen Lin,et al.  Filtering superfluous prefetches using density vectors , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[16]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[17]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..