Effectivness of Dynamic Prefetching in Multiple-Writer Distributed Virtual Shared-Memory Systems

We consider a network of workstations (NOW) organization consisting of bus-based multiprocessors interconnected by a high latency and high bandwidth interconnect, such as ATM, on which a shared-memory programming model using a multiple-writer distributed virtual shared-memory system is imposed. The latencies associated with bringing data into the local memory are a severe performance limitation of such systems. To make the access latencies tolerable, we propose a novel prefetch approach and show how it can be integrated into the software-based coherence layer of a multiple-writer protocol. This approach uses the access history of each page to guide which pages to prefetch. Based on detailed architectural simulations and seven scientific applications we find that our prefetch algorithm can remove a vast majority of the remote operations, which improves the performance of all applications. We also find that the bandwidth provided by ATM switches available today is sufficient to accommodate prefetching. However, the protocol processing overhead of available ATM interfaces limits the gain of the prefetching algorithms.

[1]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[2]  Alan L. Cox,et al.  Software versus hardware shared-memory implementation: a case study , 1994, ISCA '94.

[3]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[4]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[5]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[6]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[7]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[8]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[9]  James C. Hoe,et al.  START-NG: Delivering Seamless Parallel Computing , 1995, Euro-Par.

[10]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[11]  Ricardo Bianchini,et al.  Hiding communication latency and coherence overhead in software DSMs , 1996, ASPLOS VII.

[12]  Eric Williams,et al.  Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[13]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[14]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[15]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[16]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[17]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[18]  Per Stenström,et al.  The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors , 1993, [1993] Proceedings 26th Annual Simulation Symposium.

[19]  Per Stenström,et al.  Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[20]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[21]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[22]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[23]  Alan L. Cox,et al.  Evaluation of release consistent software distributed shared memory on emerging network technology , 1993, ISCA '93.

[24]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[25]  Per Stenström,et al.  Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.