Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps

We propose and evaluate a new data prefetching technique for cache coherent multiprocessors. Prefetches are issued by a prefetch engine which is controlled by the compiler. Second-level cache misses generate cache miss traps, and start the prefetch engine in a trap handler generated by the compiler. The only instruction overhead in our approach is when a trap handler terminates after data arrives. We present the functionality of the prefetch engine and a compiler algorithm to control it. We also study emulation of the prefetch engine in software. Our techniques are evaluated on six parallel applications using a compiler which incorporates our algorithm and a simulated multiprocessor. The prefetch engines remove up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engines remove in general less stall time at a higher instruction overhead.

[1]  Per Stenström,et al.  The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors , 1993, [1993] Proceedings 26th Annual Simulation Symposium.

[2]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[3]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[4]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[5]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[6]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[7]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[8]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[9]  LiKai,et al.  Memory coherence in shared virtual memory systems , 1989 .

[10]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[11]  Margaret Martonosi,et al.  Informing Loads: Enabling Software to Observe and React to Memory Behavior , 1995 .

[12]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[13]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.

[14]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[15]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[16]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[17]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.