An Intelligent Cache System with Hardware Prefetching for High Performance

We present a high performance cache structure with a hardware prefetching mechanism that enhances exploitation of spatial and temporal locality. The proposed cache, which we call a selective-mode intelligent (SMI) cache, consists of three parts: a direct-mapped cache with a small block size, a fully associative spatial buffer with a large block size, and a hardware prefetching unit. Temporal locality is exploited by selectively moving small blocks into the direct-mapped cache after monitoring their activity in the spatial buffer for a time period. Spatial locality is enhanced by intelligently prefetching a neighboring block when a spatial buffer hit occurs. The overhead of this prefetching operation is shown to be negligible. We also show that the prefetch operation is highly accurate: Over 90 percent of all prefetches generated are for blocks that are subsequently accessed. Our results show that the system enables the cache size to be reduced by a factor of four to eight relative to a conventional direct-mapped cache while maintaining similar performance. Also, the SMI cache can reduce the miss ratio by around 20 percent and the average memory access time by 10 percent, compared with a victim-buffer cache configuration.

[1]  James R. Larus,et al.  Optimally profiling and tracing programs , 1994, TOPL.

[2]  Steven Przybylski The performance impact of block sizes and fetch strategies , 1990, ISCA '90.

[3]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[4]  Edward S. Davidson,et al.  Reducing conflicts in direct-mapped caches with a temporality-based design , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[5]  Jang-Soo Lee,et al.  A new cache architecture based on temporal and spatial locality , 2000, J. Syst. Archit..

[6]  Mateo Valero,et al.  Static locality analysis for cache management , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[7]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[8]  Antonio Gonzalez,et al.  A data cache with multiple caching strategies tuned to different types of locality , 1995, International Conference on Supercomputing.

[9]  Ken Chan,et al.  PA7200: a PA-RISC processor with integrated high performance MP bus interface , 1994, Proceedings of COMPCON '94.

[10]  Richard E. Hank,et al.  An efficient architecture for loop based data preloading , 1992, MICRO 1992.

[11]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[12]  Tack-Don Han,et al.  A Power Efficient Cache Structure for Embedded Processors Based on the Dual Cache Structure , 2000, LCTES.

[13]  G. Albera,et al.  Power/performance advantages of victim buffer in high-performance processors , 1999, Proceedings IEEE Alessandro Volta Memorial Workshop on Low-Power Design.

[14]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[15]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.