论文信息 - NUcache: An efficient multicore cache organization based on Next-Use distance

NUcache: An efficient multicore cache organization based on Next-Use distance

The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC — Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.

R. Govindarajan | R. Manikantan | Kaushik Rajan

[1] Matthias Hauswirth,et al. Time Interpolation: So Many Metrics, So Few Registers , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2] Weng-Fai Wong,et al. Static identification of delinquent loads , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[4] Jichuan Chang,et al. Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[5] Brian Rogers,et al. Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[6] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[7] Stefanos Kaxiras,et al. Instruction-based reuse-distance prediction for effective cache management , 2009, 2009 International Symposium on Systems, Architectures, Modeling, and Simulation.

[8] Dam Sunwoo,et al. FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators , 2007, MICRO.

[9] Gabriel H. Loh,et al. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[10] José F. Martínez,et al. Scavenger: A New Last Level Cache Architecture with Global Block Priority , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11] Onur Mutlu,et al. A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[12] Weifeng Zhang,et al. Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[13] Nancy Warter-Perez,et al. Modulo scheduling with multiple initiation intervals , 1995, MICRO 1995.

[14] Mainak Chaudhuri,et al. Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[16] Yan Solihin,et al. An analytical model for cache replacement policy performance , 2006, SIGMETRICS '06/Performance '06.

[17] Ronald G. Dreslinski,et al. The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[18] Gary S. Tyson,et al. A modified approach to data cache management , 1995, MICRO 1995.

[19] Jean-Loup Baer,et al. Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[20] Yan Solihin,et al. A Framework for Providing Quality of Service in Chip Multi-Processors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[21] Yannis Smaragdakis,et al. Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[22] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[23] Yale N. Patt,et al. The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[24] Guilherme Ottoni,et al. Global Multi-Threaded Instruction Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25] Aamer Jaleel,et al. Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26] Stefanos Kaxiras,et al. Cache replacement based on reuse-distance prediction , 2007, 2007 25th International Conference on Computer Design.

[27] Jaehyuk Huh,et al. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[28] R. Govindarajan,et al. Emulating Optimal Replacement with a Shepherd Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).