Sequential Hardware Prefetching in Shared-Memory Multiprocessors

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively. >

[1]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[2]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[3]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[4]  Pen-Chung Yew,et al.  : Data Prefetching In Shared Memory Multiprocessors , 1987, ICPP.

[5]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[6]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS III.

[7]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[8]  Per Stenström,et al.  A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.

[9]  Michel Dubois,et al.  Memory Access Dependencies in Shared-Memory Multiprocessors , 1990, IEEE Trans. Software Eng..

[10]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[11]  HennessyJohn,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991 .

[12]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[13]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[14]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.

[15]  Lars Lundberg,et al.  A Lockup-Free Multiprocessor Cache Design , 1991, ICPP.

[16]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[17]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[18]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[19]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[20]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[21]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[22]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[23]  Per Stenström,et al.  The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors , 1993, [1993] Proceedings 26th Annual Simulation Symposium.

[24]  Michel Dubois,et al.  Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[25]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[26]  Livio Ricciulli,et al.  The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.

[27]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[28]  Michel Dubois,et al.  Combined performance gains of simple cache protocol extensions , 1994, ISCA '94.

[29]  Per Stenström,et al.  Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.