Application scaling under shared virtual memory on a cluster of SMPs

In this paper we examine how application performance scales on a state-of-the-art shared virtual memory (SVM) system on a cluster with 64 processors, comprising 4-way SMPs connected with a fast system area network. The protocol we use is home-based and takes advantage of general-purpose data movement and mutual exclusion support provided by a programmable network interface. We find that while the level of application restructuring needed is quite high compared to applications that perform well on a hardware-coherent system of this scale, and larger problem sizes are needed for good performance, SVM, surprisingly, performs quite well at the 64-processor scale for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end hardware-coherent system and often much more. We explore further application restructurings than those developed earlier for smaller-scale SVM systems, examine the main remaining system and application bottlenecks, and point out directions for future research.

[1]  Liviu Iftode,et al.  Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems , 1996, OSDI '96.

[2]  Jaswinder Pal Singh,et al.  Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors , 1997, PPOPP '97.

[3]  Jaswinder Pal Singh,et al.  Does Application Performance Scale on Modern Cache-coherent Multiprocessors: A Case Study of a 128-processsor SGI Origin2000 , 1999, ISCA 1999.

[4]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[5]  Liviu Iftode,et al.  Home-based SVM protocols for SMP clusters: Design and performance , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[6]  Liviu Iftode,et al.  Shared virtual memory with automatic update support , 1999, ICS '99.

[7]  Sandhya Dwarkadas,et al.  Comparative Evaluation of Fine- and Coarse-Grain Software Distributed Shared Memory , 1998 .

[8]  Eyal de Lara,et al.  A performance comparison of homeless and home-based lazy release consistency protocols in software shared memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[9]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[10]  Kourosh Gharachorloo,et al.  Fine-grain software distributed shared memory on SMP clusters , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[11]  Anoop Gupta,et al.  Design of scalable shared-memory multiprocessors: the DASH approach , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[12]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[13]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[14]  J. L. Hennessy,et al.  An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors , 1993, Supercomputing '93.

[15]  A. Agarwal,et al.  MGS: A Multigrain Shared Memory System , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[16]  Robert W. Pfile,et al.  Typhoon-Zero Implementation: The Vortex Module , 1995 .

[17]  Jaswinder Pal Singh,et al.  Scaling application performance on a cache-coherent multiprocessor , 1999, ISCA.

[18]  J.P. Singh,et al.  Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[19]  Jaswinder Pal Singh,et al.  A methodology and an evaluation of the SGI Origin2000 , 1998, SIGMETRICS '98/PERFORMANCE '98.

[20]  Kai Li,et al.  Understanding Application Performance on Shared Virtual Memory Systems , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[21]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[22]  Michael L. Scott,et al.  Comparative evaluation of fine- and coarse-grain approaches for software distributed shared memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[23]  Cezary Dubnicki,et al.  VMMC-2 : Efficient Support for Reliable, Connection-Oriented Communication , 1997 .

[24]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .