A Preliminary Evaluation of Cache-Miss-Initiated Prefetching Techniques in Scalable Multiprocessors

Abstract : Prefetching is an important technique for reducing the average latency of memory accesses in scalable cache-coherent multiprocessors. Aggressive prefetching can significantly reduce the number of cache misses, but may introduce bursty network and memory traffic, and increase data sharing and cache pollution. Given that we anticipate enormous increases in both network bandwidth and latency, we examine whether aggressive prefetching triggered by a miss (cache-miss-initiated prefetching) can substantially improve the running time of parallel programs. Using execution-driven simulation of parallel programs on scalable cache-coherent maching, we study the performance of three cache-miss-initiated prefetching techniques: large cache blocks, sequential prefetching, and hybrid prefetching. Large cache blocks (which fetch multiple words within a single block) and sequential prefetching (which fetches multiple consecutive blocks) are well-known prefetching strategies. Hybrid prefetching is a novel technique combining hardware and software support for stride-directed prefetching. Our simulation results show that large cache blocks rarely provide significant performance improvements; the improvement in the miss rate is often too small (or nonexistent) to offset a corresponding increase in the miss penalty. Our results also show that sequential and hybrid prefetching perform better than prefetching via large cache blocks, and that hybrid prefetching performs at least as well as sequential prefetching.

[1]  Livio Ricciulli,et al.  The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.

[2]  Jack E. Veenstra,et al.  Mint Tutorial and User Manual , 1993 .

[3]  Steven Przybylski The performance impact of block sizes and fetch strategies , 1990, ISCA '90.

[4]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[5]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS III.

[6]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[7]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[8]  Anant Agarwal,et al.  Limits on Interconnection Network Performance , 1991, IEEE Trans. Parallel Distributed Syst..

[9]  W C FuJohn,et al.  Stride directed prefetching in scalar processors , 1992 .

[10]  Michel Dubois,et al.  International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 2006 .

[11]  Thomas J. LeBlanc,et al.  Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors? , 1994, ICPP.

[12]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[13]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[14]  Erik Elmroth,et al.  Parallel Block Matrix Factorizations on the Shared-Memory Multiprocessor Ibm 3090 VF/600J , 1992 .

[15]  Ricardo Bianchini,et al.  Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors? , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[16]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[17]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS 1989.

[18]  Cezary Dubnicki The effects of block size on the performance of coherent caches in shared-memory multiprocessors , 1993 .