Hardware Support for Flexible Distributed Shared Memory

Workstation-based parallel systems are attractive due to their low cost and competitive uniprocessor performance. However, supporting a cache-coherent global address space on these systems involves significant overheads. We examine two approaches to coping with these overheads. First, DSM-specific hardware can be added to the off the-shelf component base to reduce overheads. Second, application-specific coherence protocols can avoid some overheads by exploiting programmer (or compiler) knowledge of an application's communication patterns. To explore the interaction between these approaches, we simulated four designs that add DSM acceleration hardware to a collection of off-the-shelf workstation nodes. Three of the designs support user-level software coherence protocols, enabling application-specific protocol optimizations. To verify the feasibility of our hardware approach, we constructed a prototype of the simplest design. Measured speedups from the prototype match simulation results closely. We find that, even with aggressive DSM hardware support, custom protocols can provide significant speedups for some applications. In addition, the custom protocols are generally effective at reducing the impact of other overheads, including those due to less aggressive hardware support and larger network latencies. However, for three of our benchmarks, the additional hardware acceleration provided by our most aggressive design avoid the need to develop more efficient custom protocols.

[1]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[2]  Babak Falsafi,et al.  Scheduling communication on an SMP node parallel machine , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[3]  Doug Burger,et al.  Parallelizing appbt for a shared- memory multiprocessor , 1985 .

[4]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[5]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[6]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[7]  David A. Wood,et al.  Mechanisms for distributed shared memory , 1996 .

[8]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[9]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[10]  James R. Larus,et al.  Optimizing communication in HPF programs on fine-grain distributed shared memory , 1997, PPOPP '97.

[11]  Ricardo Bianchini,et al.  Hiding communication latency and coherence overhead in software DSMs , 1996, ASPLOS VII.

[12]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[13]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[14]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[15]  Alan L. Cox,et al.  An integrated compile-time/run-time software distributed shared memory system , 1996, ASPLOS VII.

[16]  David A. Wood,et al.  Paging tradeoffs in distributed-shared-memory multiprocessors , 1994, Proceedings of Supercomputing '94.

[17]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[18]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[19]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[20]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[21]  James R. Larus,et al.  Teapot: A Domain-Specific Language for Writing Cache Coherence Protocols , 1999, IEEE Trans. Software Eng..

[22]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[23]  Richard B. Gillett Memory Channel Network for PCI , 1996, IEEE Micro.

[24]  David Lee,et al.  The S3.mp architecture: a local area multiprocessor , 1993, SPAA '93.

[25]  J. Larus,et al.  Shared-memory performance profiling , 1997, PPOPP '97.

[26]  S.K. Reinhardt,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[27]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[28]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[29]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[30]  Liviu Iftode,et al.  Relaxed consistency and coherence granularity in DSM systems: a performance evaluation , 1997, PPOPP '97.

[31]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[32]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[33]  Robert W. Pfile,et al.  Typhoon-Zero Implementation: The Vortex Module , 1995 .

[34]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[35]  Babak Falsafi,et al.  Kernel Support for the Wisconsin Wind Tunnel , 1993, USENIX Microkernels and Other Kernel Architectures Symposium.

[36]  Margaret Martonosi,et al.  Integrating performance monitoring and communication in parallel computers , 1996, SIGMETRICS '96.

[37]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[38]  James R. Larus,et al.  EEL: machine-independent executable editing , 1995, PLDI '95.

[39]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[40]  Erik Hagersten,et al.  Simple COMA node implementations , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[41]  Ricardo Bianchini,et al.  Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[42]  Richard P. Martin,et al.  Assessing Fast Network Interfaces , 1996, IEEE Micro.

[43]  Evgenia Smirni,et al.  The KSR1: experimentation and modeling of poststore , 1993, SIGMETRICS '93.

[44]  Anant Agarwal,et al.  Integrating message-passing and shared-memory: early experience , 1993, PPOPP '93.

[45]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[46]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[47]  James R. Larus,et al.  Teapot: language support for writing memory coherence protocols , 1996, PLDI '96.

[48]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[49]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[50]  James R. Larus,et al.  Implementing Fine-grain Distributed Shared Memory on Commodity SMP Workstations , 1996 .

[51]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[52]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[53]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[54]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[55]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[56]  Henri E. Bal,et al.  Object distribution in Orca using Compile-Time and Run-Time techniques , 1993, OOPSLA '93.

[57]  Henry M. Levy,et al.  Hardware and software support for efficient exception handling , 1994, ASPLOS VI.

[58]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.