Decoupled Hardware Support for Distributed Shared Memory

This paper investigates hardware support for fine-grain distributed shared memory (DSM) in networks of workstations. To reduce design time and implementation cost relative to dedicated DSM systems, we decouple the functional hardware components of DSM support, allowing greater use of off-the-shelf devices.We present two decoupled systems, Typhoon-0 and Typhoon-1. Typhoon-0 uses an off-the-shelf protocol processor and network interface; a custom access control device is the only DSM-specific hardware. To demonstrate the feasibility and simplicity of this access control device, we designed and built an FPGA-based version in under one year. Typhoon-1 also uses an off-the-shelf protocol processor, but integrates the network interface and access control devices for higher performance.We compare the performance of the two decoupled systems with two integrated systems via simulation. For six benchmarks on 32 nodes, Typhoon-0 ranges from 30% to 309% slower than the best integrated system, while Typhoon-1 ranges from 13% to 132% slower. Four of the six benchmarks achieve speedups of 12 to 18 on Typhoon-0 and 15 to 26 on Typhoon-1, compared with 19 to 35 on the best integrated system. Two benchmarks are hampered by high communication overheads, but selectively replacing shared-memory operations with message passing provides speedups of at least 16 on both decoupled systems. These speedups indicate that decoupled designs can potentially provide a cost-effective alternative to complex high-end DSM systems.

[1]  Erik Hagersten,et al.  Simple COMA node implementations , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[2]  LiKai,et al.  Memory coherence in shared virtual memory systems , 1989 .

[3]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[4]  Doug Burger,et al.  Parallelizing appbt for a shared- memory multiprocessor , 1985 .

[5]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[6]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[7]  Robert W. Pfile,et al.  Typhoon-Zero Implementation: The Vortex Module , 1995 .

[8]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[9]  Kai Li,et al.  Protected, user-level DMA for the SHRIMP network interface , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[10]  Babak Falsafi,et al.  When does Dedicated Protocol Processing Make Sense , 1996 .

[11]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[12]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[13]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[14]  Dharma P. Agrawal,et al.  Proceedings of the 11th annual international symposium on Computer architecture , 1984 .

[15]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[16]  James R. Larus,et al.  EEL: machine-independent executable editing , 1995, PLDI '95.

[17]  David A. Wood,et al.  Cost-Effective Parallel Computing , 1995, Computer.

[18]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[19]  Ricardo Bianchini,et al.  Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[20]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[21]  James C. Hoe,et al.  START-NG: Delivering Seamless Parallel Computing , 1995, Euro-Par.

[22]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[23]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[24]  Michael L. Scott,et al.  Software cache coherence for large scale multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[25]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[26]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[27]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[28]  James R. Larus,et al.  Mechanisms For Cooperative Shared Memory , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[29]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[30]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[31]  Eric A. Brewer,et al.  Remote queues: exposing message queues for optimization and atomicity , 1995, SPAA '95.

[32]  H. Grahn,et al.  Efficient strategies for software-only directory protocols in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[33]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[34]  Dana S. Henry,et al.  A tightly-coupled processor-network interface , 1992, ASPLOS V.

[35]  David Lee,et al.  The S3.mp architecture: a local area multiprocessor , 1993, SPAA '93.

[36]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[37]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[38]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[39]  Håkan Grahn,et al.  Efficient strategies for software-only protocols in shared-memory multiprocessors , 1995, ISCA.

[40]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.