An architecture for software-controlled data prefetching

for Software-Controlled Alexander C. Klaiber, Henry M. Levy University of Washington Seattle, WA 98195 Data Prefet thing* not increased as fast as processor speeds, and the tenThis paper describes an architecture and related compiler support for software-controlled data prefetching, a technique to hide memory latency in high-performance processors. At compile-time, FETCH instructions are inserted into the instruction-stream by the compiler, based on anticipated data references and detailed information about the memory system. At run time, a separate functional unit in the CPU, the fetch unit, interprets these instructions and initiates appropriate memory reads, Prefetched data is kept in a small, fullyassociative cache, called the fetchbufler, to reduce contention with the conventional direct-mapped cache. We also introduce a prewriteback technique that can reduce the impact of stalls due to replacement writebacks in the cache. A detailed hardware model is presented and the required compiler support is developed. Simulations based on a MIPS processor model show that this technique can dramatically reduce on-chip cache miss ratios and average observed memory latency for scientific loops at only slight cost in total memory traffic.

[1]  Gerry Kane,et al.  MIPS R2000 RISC architecture , 1987 .

[2]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[4]  Eric E. Johnson Working set prefetching for cache memories , 1989, CARN.

[5]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[6]  GannonDennis,et al.  Strategies for cache and local memory management by global program transformation , 1988 .

[7]  Pen-Chung Yew,et al.  : Data Prefetching In Shared Memory Multiprocessors , 1987, ICPP.

[8]  B. Ramakrishna Rau,et al.  The Cydram 5 Stride-Insensitive Memory System , 1989, ICPP.

[9]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990, ICS '90.

[10]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[11]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[12]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[13]  Susan J. Eggers,et al.  Techniques for efficient inline tracing on a shared-memory multiprocessor , 1990, SIGMETRICS '90.

[14]  W. Kent Fuchs,et al.  TRAPEDS: producing traces for multicomputers via execution driven simulation , 1989, SIGMETRICS '89.