Dynamic cache management in multi-core architectures through run-time adaptation

Non-Uniform Cache Access (NUCA) architectures provide a potential solution to reduce the average latency for the last-level-cache (LLC), where the cache is organized into per-core local and remote partitions. Recent research has demonstrated the benefits of cooperative cache sharing among local and remote partitions. However, ignoring cache access patterns of concurrently executing applications sharing the local and remote partitions can cause inter-partition contention that reduces the overall instruction throughput. We propose a dynamic cache management scheme for LLC in NUCA-based architectures, which reduces inter-partition contention. Our proposed scheme provides efficient cache sharing by adapting migration, insertion, and promotion policies in response to the dynamic requirements of the individual applications with different cache access behaviors. Our adaptive cache management scheme allows individual cores to steal cache capacity from remote partitions to achieve better resource utilization. On average, our proposed scheme increases the performance (instructions per cycle) by 28% (minimum 8.4%, maximum 75%) compared to a private LLC organization.

[1]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[2]  Gabriel H. Loh,et al.  Zesto: A cycle-level simulator for highly detailed microarchitecture exploration , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[3]  Aamer Jaleel,et al.  Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[5]  Per Stenström,et al.  An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[6]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[7]  Sharad Malik,et al.  Challenges and Solutions for Late- and Post-Silicon Design , 2008, IEEE Design & Test of Computers.

[8]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[9]  Yale N. Patt,et al.  Utility-Based Cache Partitioning , 2006 .

[10]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[11]  Moinuddin K. Qureshi Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[12]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[13]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[14]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[15]  Dean M. Tullsen,et al.  Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling , 2005, ISCA 2005.

[16]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[17]  Chenjie Yu,et al.  Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms , 2010, Design Automation Conference.

[18]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.