Using the compiler to improve cache replacement decisions

Memory performance is increasingly determining microprocessor performance and technology trends are exacerbating this problem. Most architectures use set-associative caches with LRU replacement policies to combine fast access with relatively low miss rates. To improve replacement decisions in set-associative caches, we develop a new set of compiler algorithms that predict which data will and will not be reused and provide these hints to the architecture. We prove that the hints either match or improve hit rates over LRU. We describe a practical one-bit cache-line tag implementation of our algorithm, called evict-me. On a cache replacement, the architecture will replace a line for which the evict-me bit is set, or if none is set, it will use the LRU bits. We implement our compiler analysis and its output in the Scale compiler. On a variety of scientific programs, using the evict-me algorithm in both the level 1 and 2 caches improves simulated cycle times by up to 34% over the LRU policy by increasing hit rates. In addition, a combination of simple hardware prefetching and evict-me works together to further improve performance.

[1]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[2]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  Yannis Smaragdakis,et al.  EELRU: simple and effective adaptive page replacement , 1999, SIGMETRICS '99.

[4]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[5]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  Sarita V. Adve,et al.  RSIM Reference Manual: Version 1.0 , 1997 .

[7]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[8]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[9]  Olivier Temam,et al.  An Algorithm for Optimally Exploiting Spatial and Temporal Locality in Upper Memory Levels , 1999, IEEE Trans. Computers.

[10]  A. L. Rosenberg,et al.  Improving Replacement Decisions in Set-Associative Caches TITLE2: , 2001 .

[11]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[12]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor architecture , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[13]  Mahmut T. Kandemir,et al.  A matrix-based approach to the global locality optimization problem , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[14]  Olivier Temam,et al.  Load Scheduling with Profile Information , 2000, Euro-Par.

[15]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[16]  Olivier Temam,et al.  Quantifying loop nest locality using SPEC'95 and the perfect benchmarks , 1999, TOCS.

[17]  Santosh G. Abraham,et al.  Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.

[18]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[19]  Carole Dulong,et al.  The IA-64 Architecture at Work , 1998, Computer.

[20]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[21]  Walid Abu-Sufah,et al.  Improving the performance of virtual memory computers. , 1979 .

[22]  Sally A. McKee,et al.  Smarter Memory: Improving Bandwidth for Streamed References , 1998, Computer.

[23]  Lixin Zhang URSIM Reference Manual , 1999 .

[24]  Arnold L. Rosenberg,et al.  Improving Replacement Decisions in Set-Associative Caches , 2001 .

[25]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[26]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[27]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[28]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[29]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[30]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[31]  Mark D. Hill,et al.  A case for direct-mapped caches , 1988, Computer.

[32]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[33]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[34]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.