Shared memory computing on clusters with symmetric multiprocessors and system area networks

Cashmere is a software distributed shared memory (S-DSM) system designed for clusters of server-class machines. It is distinguished from most other S-DSM projects by (1) the effective use of fast user-level messaging, as provided by modern system-area networks, and (2) a “two-level” protocol structure that exploits hardware coherence within multiprocessor nodes. Fast user-level messages change the tradeoffs in coherence protocol design; they allow Cashmere to employ a relatively simple directory-based coherence protocol. Exploiting hardware coherence within SMP nodes improves overall performance when care is taken to avoid interference with inter-node software coherence.We have implemented Cashmere on a Compaq AlphaServer/Memory Channel cluster, an architecture that provides fast user-level messages. Experiments indicate that a one-level, version of the Cashmere protocol provides performance comparable to, or slightly better than, that of TreadMarks' lazy release consistency. Comparisons to Compaq's Shasta protocol also suggest that while fast user-level messages make finer-grain software DSMs competitive, VM-based systems continue to outperform software-based access control for applications without extensive fine-grain sharing.Within the family of Cashmere protocols, we find that leveraging intranode hardware coherence provides a 37% performance advantage over a more straightforward one-level implementation. Moreover, contrary to our original expectations, noncoherent hardware support for remote memory writes, total message ordering, and broadcast, provide comparatively little in the way of additional benefits over just fast messaging for our application suite.

[1]  Shane M. Greenstein Debunking the productivity paradox , 1996, IEEE Micro.

[2]  Michael L. Scott,et al.  The effect of network total order, broadcast, and remote-write capability on network-based shared memory computing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[3]  Robert J. Fowler,et al.  NUMA policies and their relation to memory architecture , 1991, ASPLOS IV.

[4]  Liviu Iftode,et al.  Shared Virtual Memory Across SMP Nodes Using Automatic Update: Protocols and Performance , 2007 .

[5]  Kai Li,et al.  Design and implementation of virtual memory-mapped communication on Myrinet , 1997, Proceedings 11th International Parallel Processing Symposium.

[6]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[7]  Galen C. Hunt,et al.  Vm-based Shared Memory On Low-latency, Remote-memory-access Networks , 1996, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[8]  Robert J. Fowler,et al.  The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum , 1989, SOSP '89.

[9]  Milon Mackey,et al.  An implementation of the Hamlyn sender-managed interface architecture , 1996, OSDI '96.

[10]  A A Schäffer,et al.  Parallelization of general-linkage analysis problems. , 1994, Human heredity.

[11]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[12]  Jeffrey S. Chase,et al.  The Amber system: parallel programming on a network of multiprocessors , 1989, SOSP '89.

[13]  Richard B. Gillett Memory Channel Network for PCI , 1996, IEEE Micro.

[14]  Kai Li,et al.  A Hypercube Shared Virtual Memory System , 1989, ICPP.

[15]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[16]  Kai Li,et al.  Cache coherence for shared memory multiprocessors based on virtual memory support , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[17]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[18]  Ricardo Bianchini,et al.  Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[19]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[20]  A. Agarwal,et al.  MGS: A Multigrain Shared Memory System , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[21]  Liviu Iftode,et al.  Relaxed consistency and coherence granularity in DSM systems: a performance evaluation , 1997, PPOPP '97.

[22]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[23]  Michael L. Scott,et al.  Software cache coherence for large scale multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[24]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[25]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[26]  John L. Hennessy,et al.  SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.

[27]  Ricardo Bianchini,et al.  Efficiently adapting to sharing patterns in software DSMs , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[28]  Michael L. Scott,et al.  Evaluation of Multiprocessor Memory Systems Using Off-Line Optimal Behavior , 1991, J. Parallel Distributed Comput..

[29]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[30]  James R. Larus,et al.  Sirocco: cost-effective fine-grain distributed shared memory , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[31]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[32]  Brian N. Bershad,et al.  Software write detection for a distributed shared memory , 1994, OSDI '94.

[33]  Michael L. Scott,et al.  Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[34]  Michael L. Scott,et al.  High Performance Software Coherence for Current and Future Architectures , 1995, J. Parallel Distributed Comput..

[35]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[36]  Alan L. Cox,et al.  Software versus hardware shared-memory implementation: a case study , 1994, ISCA '94.

[37]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[38]  Henri E. Bal,et al.  Parallel programming using shared objects and broadcasting , 1992, Computer.

[39]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[40]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[41]  Willy Zwaenepoel,et al.  Adaptive software cache management for distributed shared memory architectures , 1990, ISCA '90.

[42]  Kourosh Gharachorloo,et al.  Towards transparent and efficient software distributed shared memory , 1997, SOSP.

[43]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[44]  Mark D. Hill,et al.  A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[45]  Harjinder S. Sandhu,et al.  The shared regions approach to software cache coherence on multiprocessors , 1993, PPOPP '93.

[46]  Alan L. Cox,et al.  An integrated compile-time/run-time software distributed shared memory system , 1996, ASPLOS VII.

[47]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[48]  Liviu Iftode,et al.  Improving release-consistent shared virtual memory using automatic update , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[49]  Thorsten von Eicken,et al.  Incorporating Memory Management into User-Level Network Interfaces , 1997 .

[50]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[51]  Kourosh Gharachorloo,et al.  Fine-grain software distributed shared memory on SMP clusters , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[52]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[53]  Nikolaos Hardavellas,et al.  Cashmere-VLM: Remote memory paging for software distributed shared memory , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[54]  Michael L. Scott,et al.  Simple but effective techniques for NUMA memory management , 1989, SOSP '89.

[55]  Marco Fillo,et al.  Architecture and implementation of MEMORY CHANNEL 2 , 1997 .

[56]  Rishiyur S. Nikhil,et al.  Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines , 1994, LCPC.

[57]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[58]  Cheng Liao,et al.  Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems , 1999, ISCA.

[59]  Liviu Iftode,et al.  Home-based SVM protocols for SMP clusters: Design and performance , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[60]  Carla Schlatter Ellis,et al.  Experimental comparison of memory management policies for NUMA multiprocessors , 1991, TOCS.

[61]  Jeffrey S. Chase,et al.  Integrating coherency and recoverability in distributed systems , 1994, OSDI '94.

[62]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[63]  John K. Bennett,et al.  Using multicast and multithreading to reduce communication in software DSM systems , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[64]  Per Stenström,et al.  Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[65]  Alan L. Cox,et al.  Software DSM protocols that adapt between single writer and multiple writer , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[66]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[67]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[68]  Michael L. Scott,et al.  Using memory-mapped network interfaces to improve the performance of distributed shared memory , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.