Improving Cache Performance by Exploiting Read-Write Disparity

Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. The key contribution of this paper is the new idea of distinguishing between lines that are reused by reads versus those that are reused only by writes to focus cache management policies on the more critical read lines. We propose a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cachesensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP’s state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.

[1]  Dennis Shasha,et al.  2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm , 1994, VLDB.

[2]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[3]  Babak Falsafi,et al.  Selective, accurate, and timely self-invalidation using last-touch prediction , 2000, ISCA '00.

[4]  Gary S. Tyson,et al.  Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[5]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[6]  Michel Dubois,et al.  Cost-sensitive cache replacement algorithms , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[7]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[8]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[9]  Gurindar S. Sohi,et al.  Characterization of Problem Stores , 2004, IEEE Computer Architecture Letters.

[10]  Dharmendra S. Modha,et al.  CAR: Clock with Adaptive Replacement , 2004, FAST.

[11]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[12]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[13]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[14]  André Seznec,et al.  Exploiting Single-Usage for Effective Memory Management , 2007, Asia-Pacific Computer Systems Architecture Conference.

[15]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[16]  Stefanos Kaxiras,et al.  Cache replacement based on reuse-distance prediction , 2007, 2007 25th International Conference on Computer Design.

[17]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[18]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[19]  Mainak Chaudhuri,et al.  Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[21]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[22]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[23]  Gabriel H. Loh,et al.  Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Moinuddin K. Qureshi,et al.  Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[25]  Onur Mutlu,et al.  DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[26]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[27]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[28]  Samira Manabi Khan,et al.  Sampling Dead Block Prediction for Last-Level Caches , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  R. Govindarajan,et al.  NUcache: An efficient multicore cache organization based on Next-Use distance , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[30]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Zhe Wang,et al.  Improving writeback efficiency with decoupled last-write prediction , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[32]  Onur Mutlu,et al.  The evicted-address filter: A unified mechanism to address both cache pollution and thrashing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  Zhe Wang,et al.  Decoupled dynamic cache segmentation , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[34]  R. Govindarajan,et al.  Probabilistic Shared Cache Management (PriSM) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[35]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).