Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardware-coherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity.This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechanisms are simple and do not require programmability.We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and we show the value of each of the mechanisms in different applications. Application performance improves by about 37% on average for reasonably well performing applications, even on our relatively slow programmable NI, and more for others. We discuss the key remaining bottlenecks at the protocol level and use a firmware performance monitor in the NI to understand the interactions with and the implications for the communication layer.

[1]  R. Gillett,et al.  Overview of memory channel network for PCI , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[2]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[3]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[4]  Marc Levoy,et al.  Parallel visualization algorithms: performance and architectural implications , 1994, Computer.

[5]  A. Agarwal,et al.  MGS: A Multigrain Shared Memory System , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[6]  Liviu Iftode,et al.  Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems , 1996, OSDI '96.

[7]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[8]  James R. Larus,et al.  Implementing Fine-grain Distributed Shared Memory on Commodity SMP Workstations , 1996 .

[9]  Margaret Martonosi,et al.  Performance monitoring in a Myrinet-connected SHRIMP cluster , 1998, SPDT '98.

[10]  Michael L. Scott,et al.  Using memory-mapped network interfaces to improve the performance of distributed shared memory , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[11]  Kai Li,et al.  Understanding Application Performance on Shared Virtual Memory Systems , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[12]  Scott Pakin,et al.  High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[13]  GuptaAnoop,et al.  Parallel Visualization Algorithms , 1994 .

[14]  John L. Hennessy,et al.  SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.

[15]  Evangelos P. Markatos,et al.  Telegraphos: A Substrate for High-Performance Computing on Workstation Clusters , 1997, J. Parallel Distributed Comput..

[16]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[17]  Cezary Dubnicki,et al.  VMMC-2 : Efficient Support for Reliable, Connection-Oriented Communication , 1997 .

[18]  Liviu Iftode,et al.  Home-based SVM protocols for SMP clusters: Design and performance , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[19]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[20]  Ricardo Bianchini,et al.  Hiding communication latency and coherence overhead in software DSMs , 1996, ASPLOS VII.

[21]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[22]  John L. Hennessy,et al.  The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors , 1993 .

[23]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[24]  Sotiris Ioannidis,et al.  Efficient Use of Memory Mapped Interfaces for Shared Memory Computing , 1997 .

[25]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[26]  Yuanyuan Zhou,et al.  Limits to the performance of software shared memory: a layered approach , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[27]  Kourosh Gharachorloo,et al.  Fine-grain software distributed shared memory on SMP clusters , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[28]  Bernard Tourancheau,et al.  BIP: A New Protocol Designed for High Performance Networking on Myrinet , 1998, IPPS/SPDP Workshops.

[29]  Liviu Iftode,et al.  Relaxed consistency and coherence granularity in DSM systems: a performance evaluation , 1997, PPOPP '97.

[30]  Jaswinder Pal Singh,et al.  Scaling application performance on a cache-coherent multiprocessor , 1999, ISCA.

[31]  L. Hernquist Hierarchical N-body methods , 1987 .

[32]  J.P. Singh,et al.  Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[33]  Jaswinder Pal Singh,et al.  Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors , 1997, PPOPP '97.

[34]  John L. Hennessy,et al.  Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results, and Implications , 1992, J. Parallel Distributed Comput..

[35]  Liviu Iftode,et al.  Improving release-consistent shared virtual memory using automatic update , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[36]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[37]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[38]  Babak Falsafi,et al.  Scheduling communication on an SMP node parallel machine , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[39]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[40]  Marc Levoy,et al.  Volume rendering on scalable shared-memory MIMD architectures , 1992, VVS.

[41]  Liviu Iftode,et al.  Supporting a Coherent Shared Address Space Across SMP Nodes: An Application-Driven Investigation , 1999 .

[42]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[43]  Liviu Iftode,et al.  Evaluation of hardware write propagation support for next-generation shared virtual memory clusters , 1998, ICS '98.

[44]  Galen C. Hunt,et al.  Vm-based Shared Memory On Low-latency, Remote-memory-access Networks , 1996, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[45]  Angelos Bilas,et al.  The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[46]  Angelos Bilas,et al.  User-Space Communication: A Quantitative Study , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[47]  D. Brandt,et al.  Multi-level adaptive solutions to boundary-value problems math comptr , 1977 .

[48]  Liviu Iftode,et al.  Understanding Application Performance on Shared Virtual Memory Systems , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[49]  Per Stenström,et al.  Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[50]  Hiroshi Tezuka PM : A High-Performance Communication Library for Multi-user Parallel Environments , 1996 .

[51]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.