论文信息 - MLP-Aware Cache Replacement A Case for MLP-Aware Cache Replacement

MLP-Aware Cache Replacement A Case for MLP-Aware Cache Replacement

Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses – some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, traditional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a run-time technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement policy, depending on which is more effective in reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%.

Moinuddin K. Qureshi | Daniel N. Lynch | O. Mutlu | M. Qureshi

[1] Alvin R. Lebeck,et al. Load latency tolerance in dynamically scheduled processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[2] Michel Dubois,et al. Cost-sensitive cache replacement algorithms , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[3] Jean-Loup Baer,et al. Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[4] Neal Young,et al. The K-Server Dual and Loose Competitiveness for Paging , 1991, On-Line Algorithms.

[5] Sarita V. Adve,et al. Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[6] Trevor N. Mudge,et al. Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[7] Richard E. Kessler,et al. Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[8] James E. Smith,et al. A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[9] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[10] Laszlo A. Belady,et al. A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[11] Brad Calder,et al. Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[12] Brian Fahs,et al. Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[13] Michel Dubois,et al. Optimal replacements in caches with two miss costs , 1999, SPAA '99.

[14] Haitham Akkary,et al. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[15] David Kroft,et al. Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[16] Huiyang Zhou,et al. Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[17] Francisco J. Cazorla,et al. Kilo-instruction processors: overcoming the memory wall , 2005, IEEE Micro.

[18] Chris Wilkerson,et al. Locality vs. criticality , 2001, ISCA 2001.

[19] S. McFarling. Combining Branch Predictors , 1993 .

[20] Maurice V. Wilkes,et al. The memory gap and the future of high performance memories , 2001, CARN.

[21] Haitham Akkary,et al. Continual flow pipelines , 2004, ASPLOS XI.

[22] Huiyang Zhou,et al. Enhancing memory-level parallelism via recovery-free value prediction , 2005, IEEE Transactions on Computers.

[23] Tejas Karkhanis,et al. A Day in the Life of a Data Cache Miss , 2002 .