Characterization and improvement of load/store cache-based prefetching

A common mechamsm to pcrlorm hat-dware-based pl-efetching for regular accesses to arrays and chained lists is based on a Load/Store cache (LSC). An LSC associates the address of a Id/SC instruction with Its individual hchavior at every entry. WC show that the implementation cost of the LSC is rather high, and that using it is Ineff’icicnt. We aim to decrease the cost of the LSC but not its pcrl’ormancc. This may he done preventing useless instructions from hemg stored in the LSC. We propose eliminatmg those inslructions that never miss, and those that follow a sequential pilttWl1. This may be carried out by insertin, 0 a lci/ it inslruction in the 1,SC whenever it misses in the data cache (on-miss insertion), and issuing sequential prefetching simultaneously. After having analy~d the perf’ormancc of this proposal through a cycle-by-cycle simulation oveta set 01. 25 benchmarks selected from SPEC9.5, SPEC92 and Perfect Club, we conclude that an LSC of only 8 entries, which combines on-miss insertion and sequential prettching, performs hetter than a conventional LSC of 5 I2 entries. We think thal the low COSI of the proposal makes it worth being taken into account for the development of future microprocessors.

[1]  Luddy Harrison Examination of a memory access classification scheme for pointer-intensive and numeric programs , 1996, ICS '96.

[2]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[4]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[5]  Apostolos Dollas,et al.  Predicting and precluding problems with memory latency , 1994, IEEE Micro.

[6]  Víctor Viñals,et al.  Performance assessment of contents management in multilevel on-chip caches , 1996, Proceedings of EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies.

[7]  Yvon Jégou,et al.  Speculative prefetching , 1993, ICS '93.

[8]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[9]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[10]  J LiljaDavid,et al.  When Caches Aren't Enough , 1997 .

[11]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[12]  Michel Dubois,et al.  Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[13]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[14]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[15]  J.W.C. Fu,et al.  Stride Directed Prefetching In Scalar Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[16]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[17]  Dean M. Tullsen,et al.  Effective cache prefetching on bus-based multiprocessors , 1995, TOCS.

[18]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[19]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[20]  Per Stenström,et al.  Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[21]  David J. Lilja,et al.  When Caches Aren't Enough: Data Prefetching Techniques , 1997, Computer.

[22]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.