MRU-Tour-based Replacement Algorithms for Last-Level Caches

Memory hierarchy design is a major concern in current microprocessors. Many research work focuses on the Last-Level Cache (LLC), which is designed to hide the long miss penalty of accessing to main memory. To reduce both capacity and conflict misses, LLCs are implemented as large memory structures with high associativities. To exploit temporal locality, LRU is the replacement algorithm usually implemented in caches. However, for a high-associative cache, its implementation is costly in terms of area and power consumption. Indeed, LRU is not well suited for the LLC, because as this cache level does not see all memory accesses, it cannot cope with temporal locality. In addition, blocks must descend down to the LRU position of the stack before eviction, even when they are not longer useful. In this paper, we show that most of the blocks are not referenced again once they leave the MRU position. Moreover, the probability of being referenced again does not depend on the location on the LRU stack. Based on these observations, we define the number of MRU-Tours (MRUTs) of a block as the number of times that a block occupies the MRU position while it is stored in the cache, and propose the MRUT replacement algorithm, which selects the block to be replaced among the blocks that show only one MRUT. Variations of this algorithm have been also proposed to exploit both MRUT behavior and recency of information. Experimental results show that, compared to LRU, the proposal reduces the MPKI up to 22%, while IPC is improved by 48%.

[1]  Song Jiang,et al.  LIRS: an efficient low inter-reference recency set replacement policy to improve buffer cache performance , 2002, SIGMETRICS '02.

[2]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[3]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[4]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[5]  Balaram Sinharoy,et al.  POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).

[6]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[7]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[8]  Bing Xue,et al.  Divide-and-conquer: a bubble replacement for low level caches , 2009, ICS.

[9]  Lasse Natvig,et al.  An LRU-based replacement algorithm augmented with frequency of access in shared chip-multiprocessor caches , 2006, MEDEA '06.

[10]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[11]  Jean-Loup Baer,et al.  Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors , 2009 .

[12]  Nancy Warter-Perez,et al.  Modulo scheduling with multiple initiation intervals , 1995, MICRO 1995.

[13]  Mainak Chaudhuri,et al.  Pseudo-LIFO: The foundation of a new family of replacement policies for last-level caches , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.

[15]  Steven K. Reinhardt,et al.  Predicting Last-Touch References under Optimal Replacement , 2002 .

[16]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[17]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[18]  M. K. Gowan,et al.  A 65 nm 2-Billion Transistor Quad-Core Itanium Processor , 2009, IEEE Journal of Solid-State Circuits.