Run-time spatial locality detection and optimization

As the disparity between processor and main memory performance grows, the number of execution cycles spent waiting for memory accesses to complete also increases. As a result, latency hiding techniques are critical for improved application performance on future processors. We present a microarchitecture scheme which detects and adapts to varying spatial locality, dynamically adjusting the amount of data fetched on a cache miss. The spatial locality detection table, introduced in this paper, facilitates the detection of spatial locality across adjacent cached blocks. Results from detailed simulations of several integer programs show significant speedups. The improvements are due to the reduction of conflict and capacity misses by utilizing small blocks and small fetch sizes when, spatial locality is absent, and the prefetching effect of large fetch sizes when spatial locality exists.

[1]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[2]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.

[3]  Nancy Warter-Perez,et al.  Modulo scheduling with multiple initiation intervals , 1995, MICRO 1995.

[4]  Antonio Gonzalez,et al.  A data cache with multiple caching strategies tuned to different types of locality , 1995, International Conference on Supercomputing.

[5]  Steven A. Przybylski,et al.  The performance impact of block sizes and fetch strategies , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[6]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[7]  Michel Dubois,et al.  International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 2006 .

[8]  Edward S. Davidson,et al.  Reducing conflicts in direct-mapped caches with a temporality-based design , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[9]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[10]  Mateo Valero,et al.  A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality , 1995, International Conference on Supercomputing.

[11]  Santosh G. Abraham,et al.  Efficient simulation of caches under optimal replacement with applications to miss characterization , 1993, SIGMETRICS '93.

[12]  J.W.C. Fu,et al.  Data prefetching in multiprocessor vector cache memories , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[13]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[14]  C MowryTodd,et al.  Compiler-based prefetching for recursive data structures , 1996 .

[15]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[16]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[17]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[18]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, MICRO 1995.

[19]  Alan Jay Smith,et al.  Line (Block) Size Choice for CPU Cache Memories , 1987, IEEE Transactions on Computers.

[20]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[22]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[23]  Apostolos Dollas,et al.  Predicting and precluding problems with memory latency , 1994, IEEE Micro.

[24]  Janak H. Patel,et al.  How to Simulate 100 Billion References Cheaply , 1991 .

[25]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[26]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[27]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[28]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[29]  Sharad Mehrotra,et al.  Quantifying the Performance Potential of a Data Prefetch Mechanism for Pointer-Intensive and Numeric , 1995 .

[30]  Scott A. Mahlke,et al.  IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors , 1998, 25 Years ISCA: Retrospectives and Reprints.