A performance study of software and hardware data prefetching schemes

Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

[1]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990, ICS '90.

[2]  Dean M. Tullsen,et al.  Limitations of cache prefetching on a bus-based multiprocessor , 1993, ISCA '93.

[3]  Scott A. Mahlke,et al.  Data access microarchitectures for superscalar processors with compiler-assisted data prefetching , 1991, MICRO 24.

[4]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[5]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[6]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[7]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[8]  Tien-Fu Chen,et al.  Data prefetching for high-performance processors , 1993 .

[9]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[10]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[11]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[12]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[13]  Susan J. Eggers,et al.  Eliminating False Sharing , 1991, ICPP.

[14]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[15]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[16]  J.W.C. Fu,et al.  Data prefetching in multiprocessor vector cache memories , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[17]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[18]  Janak H. Patel,et al.  Data prefetching in multiprocessor vector cache memories , 1991, ISCA '91.