Cooperative caching with keep-me and evict-me

Cooperative caching seeks to improve memory system performance by using compiler locality hints to assist hardware cache decisions. In this paper, the compiler suggests cache lines to keep or evict in set-associative caches. A compiler analysis predicts data that will be and will not be reused, and annotates the corresponding memory operations with a keep-me or evict-me hint. The architecture maintains these hints on a cache line and only acts on them on a cache miss. Evict-me caching prefers to evict lines marked evict-me. Keep-me caching retains keep-me lines if possible. Otherwise, the default replacement algorithm evicts the least-recently-used (LRU) line in the set. This paper introduces the keep-me hint, the associated compiler analysis, and architectural support. The keep-me architecture includes very modest ISA support, replacement algorithms, and decay mechanisms that avoid retaining keep-me lines indefinitely. Our results are mixed for our implementation of keep-me, but show it has potential. We combine keep-me and evict-me from previous work, but find few additive benefits due to limitations in our compiler algorithm, which only applies each independently rather than performing a combined analysis.

[1]  R. E. Kessler,et al.  Inexpensive implementations of set-associativity , 1989, ISCA '89.

[2]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[3]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[4]  Guang R. Gao,et al.  Compiler-Assisted Cache Replacement: Problem Formulation and Performance Evaluation , 2003, LCPC.

[5]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[6]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[7]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[8]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[9]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[10]  MartonosiMargaret,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998 .

[11]  Yannis Smaragdakis,et al.  EELRU: simple and effective adaptive page replacement , 1999, SIGMETRICS '99.

[12]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[13]  Gary S. Tyson,et al.  Region-based caching: an energy-delay efficient memory architecture for embedded processors , 2000, CASES '00.

[14]  Kathryn S. McKinley,et al.  Cooperative hardware/software caching for next-generation memory systems , 2004 .

[15]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[16]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[17]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[18]  François Bodin,et al.  Skewed associativity enhances performance predictability , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  Olivier Temam,et al.  Quantifying loop nest locality using SPEC'95 and the perfect benchmarks , 1999, TOCS.

[20]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[21]  Steven W. White,et al.  POWER3: The next generation of PowerPC processors , 2000, IBM J. Res. Dev..

[22]  Jih-Kwon Peir,et al.  Capturing dynamic memory reference behavior with adaptive cache topology , 1998, ASPLOS VIII.

[23]  Dileep Bhandarkar,et al.  Performance characterization of the Pentium Pro processor , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[24]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[25]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[26]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[27]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[28]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[29]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[30]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[31]  Olivier Temam,et al.  An Algorithm for Optimally Exploiting Spatial and Temporal Locality in Upper Memory Levels , 1999, IEEE Trans. Computers.

[32]  A. Agarwal,et al.  Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[33]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[34]  Walid Abu-Sufah,et al.  Improving the performance of virtual memory computers. , 1979 .

[35]  Mahmut T. Kandemir,et al.  A matrix-based approach to the global locality optimization problem , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[36]  T. N. Vijaykumar,et al.  Reactive-associative caches , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[37]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[38]  Santosh G. Abraham,et al.  Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.

[39]  Sally A. McKee,et al.  Smarter Memory: Improving Bandwidth for Streamed References , 1998, Computer.

[40]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[41]  Carole Dulong,et al.  The IA-64 Architecture at Work , 1998, Computer.

[42]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[43]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[44]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[45]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.