NUcache: An efficient multicore cache organization based on Next-Use distance

The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC — Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.

[1]  Matthias Hauswirth,et al.  Time Interpolation: So Many Metrics, So Few Registers , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Weng-Fai Wong,et al.  Static identification of delinquent loads , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[4]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[5]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[6]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[7]  Stefanos Kaxiras,et al.  Instruction-based reuse-distance prediction for effective cache management , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[8]  Dam Sunwoo,et al.  FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators , 2007, MICRO.

[9]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[10]  José F. Martínez,et al.  Scavenger: A New Last Level Cache Architecture with Global Block Priority , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[12]  Weifeng Zhang,et al.  Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[13]  Nancy Warter-Perez,et al.  Modulo scheduling with multiple initiation intervals , 1995, MICRO 1995.

[14]  Mainak Chaudhuri,et al.  Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[16]  Yan Solihin,et al.  An analytical model for cache replacement policy performance , 2006, SIGMETRICS '06/Performance '06.

[17]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[18]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, MICRO 1995.

[19]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[20]  Yan Solihin,et al.  A Framework for Providing Quality of Service in Chip Multi-Processors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[21]  Yannis Smaragdakis,et al.  Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[22]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[23]  Yale N. Patt,et al.  The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[24]  Guilherme Ottoni,et al.  Global Multi-Threaded Instruction Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  Aamer Jaleel,et al.  Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26]  Stefanos Kaxiras,et al.  Cache replacement based on reuse-distance prediction , 2007, 2007 25th International Conference on Computer Design.

[27]  Jaehyuk Huh,et al.  Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[28]  R. Govindarajan,et al.  Emulating Optimal Replacement with a Shepherd Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).