Deconstructing the Inefficacy of Global Cache Replacement Policies

In a conventional two-level cache hierarchy, L1 cache hits do not propagate to the L2 cache; as a result, the L2 cache only observes a “filtered” memory access stream. A frequently accessed address may hit in the L1, but since these accesses never make it to the L2, the corresponding copy in the L2 will “decay” with respect to its replacement policy state and may eventually get evicted. Previous studies have advocated the use of global replacement policies where the L1 access information propagates to the L2 to maintain a replacement policy state that is consistent with the overall global memory access stream. We first attempt to duplicate previously reported results on global cache replacement policies. Despite the intuitive explanation for why a global scheme should work, our experimental results show that the performance potential of global replacement is very limited. We deconstruct the problem with reuse-distance analysis and show that only under very specific reuse-distance profiles will a program be able to benefit from global replacement. Our experiments include the evaluation of multi-core shared caches, inclusive cache hierarchies, and a wide spectrum of cache sizes and associativities; we show that global replacement fails to provide significant performance benefits for any of these scenarios.

[1]  David A. Bader,et al.  BioPerf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[2]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Brad Calder,et al.  Picking statistically valid and early simulation points , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[4]  R. Govindarajan,et al.  Emulating Optimal Replacement with a Shepherd Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Donald Yeung,et al.  BioBench: A Benchmark Suite of Bioinformatics Applications , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[6]  Steve Carr,et al.  Reuse-distance-based miss-rate prediction on a per instruction basis , 2004, MSP '04.

[7]  Per Stenström,et al.  An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[8]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[9]  Jack Doweck,et al.  Inside Intel® Core microarchitecture , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[10]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[11]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[12]  Jason E. Fritts,et al.  MediaBench II video: expediting the next generation of video systems research , 2005, IS&T/SPIE Electronic Imaging.

[13]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[15]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI '03.

[16]  M. Zahran Cache Replacement Policy Revisited , 2022 .

[17]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[18]  Yannis Smaragdakis,et al.  Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[19]  Pierre Michaud,et al.  A case for (partially) TAgged GEometric history length branch prediction , 2006, J. Instr. Level Parallelism.