Distributed prefetch-buffer/cache design for high performance memory systems

Microprocessor execution speeds are improving at a rate of 50%-80% per year while DRAM access times are improving at a much lower rate of 5%-10% per year. Computer systems are rapidly approaching the point at which overall system performance is determined not by the speed of the CPU but by the memory system speed. We present a high performance memory system architecture that overcomes the growing speed disparity between high performance microprocessors and current generation DRAMs. A novel prediction and prefetching technique is combined with a distributed cache architecture to build a high performance memory system. We use a table based prediction scheme with a prediction cache to prefetch data from the on-chip DRAM array to an on-chip SRAM prefetch buffer. By prefetching data we are able to hide the large latency associated with DRAM access and cycle times. Our experiments show that with a small (32 KB) prediction cache we can get an effective main memory access time that is close to the access time of larger secondary caches.

[1]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[2]  David W. Wall,et al.  Generation and analysis of very long address traces , 1990, ISCA '90.

[3]  Ernst H. Kristiansen,et al.  Trace-driven simulations for a two-level cache design in open bus systems , 1990, ISCA '90.

[4]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[5]  Gershon Kedem,et al.  Breaking the barrier of parallel simulation of digital systems , 1991, 28th ACM/IEEE Design Automation Conference.

[6]  The sparc architecture manual , 1991 .

[7]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[8]  Wen-Hann Wang,et al.  Efficient trace-driven simulation methods for cache performance analysis , 1991, TOCS.

[9]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[10]  Mark E. Stickel,et al.  Caching and Lemmaizing in Model Elimination Theorem Provers , 1992, CADE.

[11]  J.W.C. Fu,et al.  Stride Directed Prefetching In Scalar Processors , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[12]  Edward H. Gornish Adaptive and integrated data cache prefetching for shared-memory multiprocessors , 1994 .

[13]  B. Prince,et al.  Memory in the fast lane , 1994, IEEE Spectrum.

[14]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[15]  Monica S. Lam,et al.  Locality Optimizations for Parallel Machines , 1994, CONPAR.

[16]  David Keppel,et al.  Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[17]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[18]  Thomas Alexander,et al.  A distributed predictive cache for high performance computer systems , 1995 .

[19]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.