Data forwarding in scalable shared-memory multiprocessors

Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different lev els of support for forwarding. Our simulations of a 32-processor machine show that, on average, a slightly-optimistic support for forwarding speeds up five applications by 50’% for large caches and 3070 for small caches. For large caches, most read sharing misses can be eliminated, whale for small caches, forwarding rarely increases the number of conflict misses. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups.

[1]  D. K. Poulsen,et al.  Execution-driven tools for parallel simulation of parallel architectures and applications , 1993, Supercomputing '93.

[2]  David K. Poulsen Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors , 1994 .

[3]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[4]  Anant Agarwal,et al.  Performance Tradeoffs in Multithreaded Processors , 1992, IEEE Trans. Parallel Distributed Syst..

[5]  Milind Girkar,et al.  Parafrase-2: an Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors , 1989, Int. J. High Speed Comput..

[6]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[7]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[8]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[9]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[10]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[11]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[12]  Bailey Proceedings of the 1988 international conference on parallel processing , 1988 .

[13]  Michael J. Flynn,et al.  Update-based cache coherence protocols for scalable shared-memory multiprocessors , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[14]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[15]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[16]  Pen-Chung Yew,et al.  Data Prefetching and Data Forwarding in Shared Memory Multiprocessors , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[17]  Marc Snir,et al.  The Performance of Multistage Interconnection Networks for Multiprocessors , 1983, IEEE Transactions on Computers.