Effects of memory latencies on non-blocking processor/cache architectures

In this paper, we introduce a simple hardware mechanism supporting non-blocking loads in conjunction with lockup-free caches to hide memory latencies in high-performance processors. The cache and processor cooperate on load misses so that the overall complexity of the non-blocking mechanisms in the cache and in the processor is greatly reduced. We use detailed simulations to evaluate the effectiveness of the architecture and of a simple compiler transformation at hiding miss latencies of up to 200 processor cycles. For a given program we identify a critical latency. For latencies lower than this critical latency, the non-blocking processor/cache architecture achieves perfect memory latency tolerance by overlapping misses with processor execution. For higher latencies, significant improvements in processor efficiency are still obtained by overlapping multiple misses together. A simple model is used to illustrate this effect and improvements are proposed based on the results.

[1]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[2]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[3]  Anne Rogers,et al.  Software support for speculative loads , 1992, ASPLOS V.

[4]  James E. Smith,et al.  A study of scalar compilation techniques for pipelined supercomputers , 1987, ASPLOS.

[5]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[6]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[7]  Peter M. Kogge,et al.  The Architecture of Symbolic Computers , 1990 .

[8]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[9]  Pen-Chung Yew,et al.  The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors , 1987 .

[10]  H. Levy,et al.  An architecture for software-controlled data prefetching , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[11]  Jack J. Dongarra,et al.  Unrolling loops in fortran , 1979, Softw. Pract. Exp..

[12]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[14]  Shlomo Weiss,et al.  A study of scalar compilation techniques for pipelined supercomputers , 1987, ASPLOS 1987.

[15]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[16]  Anoop Gupta,et al.  Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.

[17]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO 1992.

[18]  H. C. Burg,et al.  1991 International Conference on Supercomputing , 1992, Parallel Comput..

[19]  Peter M. Kogge,et al.  The Architecture of Pipelined Computers , 1981 .