The effect of network total order, broadcast, and remote-write capability on network-based shared memory computing

Emerging system-area networks provide a variety of features that can dramatically reduce network communication overhead. In this paper, we evaluate the impact of such features on the implementation of Software Distributed Shared Memory (SDSM), and on the Cashmere system in particular. Cashmere has been implemented on the Compaq Memory Channel network, which supports low-latency messages, protected remote memory writes, in-expensive broadcast, and total ordering of network packets. Our evaluation is based on several Cashmere protocol variants, ranging from a protocol that fully leverages the Memory Channel's special features to one that uses the network only for fast messaging. We find that the special features improve performance by 18-44% for three of our applications, but less than 12% for our other seven applications. We also find that home node migration, an optimization available only in the message-based protocol, can improve performance by as much as 67%. These results suggest that for systems of modest size, low latency is much more important for SDSM performance than are remote writes, broadcast, or total ordering. At the same time, results on an emulated 32-node system indicate that broadcast based on remote writes of widely-shared data may improve performance by up to 51% for some applications. If hardware broadcast or multicast facilities can be made to scale, they can be beneficial in future system-area networks.

[1]  Milon Mackey,et al.  An implementation of the Hamlyn sender-managed interface architecture , 1996, OSDI '96.

[2]  A A Schäffer,et al.  Parallelization of general-linkage analysis problems. , 1994, Human heredity.

[3]  Kai Li,et al.  UTLB: a mechanism for address translation on network interfaces , 1998, ASPLOS VIII.

[4]  John K. Bennett,et al.  Using multicast and multithreading to reduce communication in software DSM systems , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[5]  Ricardo Bianchini,et al.  Efficiently adapting to sharing patterns in software DSMs , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[6]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[7]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[8]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[9]  Galen C. Hunt,et al.  Vm-based Shared Memory On Low-latency, Remote-memory-access Networks , 1996, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[11]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[12]  John L. Hennessy,et al.  SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.

[13]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[14]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[15]  Michael L. Scott,et al.  Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[16]  Kourosh Gharachorloo,et al.  Towards transparent and efficient software distributed shared memory , 1997, SOSP.

[17]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[18]  Ricardo Bianchini,et al.  Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[19]  J.P. Singh,et al.  Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[20]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[21]  Thorsten von Eicken,et al.  Incorporating Memory Management into User-Level Network Interfaces , 1997 .

[22]  Kourosh Gharachorloo,et al.  Fine-grain software distributed shared memory on SMP clusters , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[23]  Thorsten von Eicken,et al.  ATM and fast Ethernet network interfaces for user-level communication , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[24]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[25]  Liviu Iftode,et al.  Home-based SVM protocols for SMP clusters: Design and performance , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[26]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[27]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[28]  Alan L. Cox,et al.  Software DSM protocols that adapt between single writer and multiple writer , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[29]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[30]  Mark D. Hill,et al.  A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[31]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[32]  GharachorlooKourosh,et al.  Towards transparent and efficient software distributed shared memory , 1997 .

[33]  Michel Dubois,et al.  Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[34]  Richard B. Gillett Memory Channel Network for PCI , 1996, IEEE Micro.