A comparison of software and hardware synchronization mechanisms for distributed shared memory multiprocessors

E cient synchronization is an essential component of parallel computing The designers of traditional multiprocessors have included hardware support only for simple operations such as compare and swap and load linked store conditional while high level synchronization primitives such as locks barriers and condition variables have been implemented in software With the advent of directory based distributed shared memory DSM multiprocessors with signi cant exibility in their cache controllers it is worthwhile considering whether this exibility should be used to support higher level synchronization primitives in hardware In particular as part of maintaining data consistency these architectures maintain lists of processors with a copy of a given cache line which is most of the hardware needed to implement distributed locks We studied two software and four hardware implementations of locks and found that hard ware implementation can reduce lock acquire and release times by compared to well tuned software locks In terms of macrobenchmark performance hardware locks reduce appli cation running times by up to on a synthetic benchmark with heavy lock contention and by on a suite of SPLASH benchmarks In addition emerging cache coherence protocols promise to increase the time spent synchronizing relative to the time spent accessing shared data and our study shows that hardware locks can reduce SPLASH execution times by up to if the time spent accessing shared data is small Although the overall performance impact of hardware lock mechanisms varies tremendously depending on the application the added hardware complexity on a exible architecture like FLASH or Avalanche is negligible and thus hardware support for high level synchro nization operations should be provided This work was supported by the Space and Naval Warfare Systems Command SPAWAR and Advanced Research Projects Agency ARPA Communication and Memory Architectures for Scalable Parallel Computing ARPA order B under SPAWAR contract N C

[1]  A. Agarwal,et al.  Software-extended coherent shared memory: performance and cost , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[2]  Willy Zwaenepoel,et al.  Adaptive software cache management for distributed shared memory architectures , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[4]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[5]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[6]  James K. Archibald,et al.  Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[7]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[8]  Robert J. Fowler,et al.  A performance evaluation of optimal hybrid cache coherency protocols , 1992, ASPLOS V.

[9]  Willy Zwaenepoel,et al.  Techniques for reducing consistency-related communication in distributed shared-memory systems , 1995, TOCS.

[10]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[11]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[12]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[13]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[14]  David A. Wood,et al.  Accuracy vs. performance in parallel simulation of interconnection networks , 1995, Proceedings of 9th International Parallel Processing Symposium.

[15]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[16]  John B. Carter,et al.  An argument for simple COMA , 1995, Future Gener. Comput. Syst..

[17]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[18]  Maged M. Michael,et al.  Scalability of Atomic Primitives on Distributed Shared Memory Multiprocessors , 1994 .

[19]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[20]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[21]  Richard P. LaRowe,et al.  Hiding Shared Memory Reference Latency on the Galactica Net Distributed Shared Memory Architecture , 1992, J. Parallel Distributed Comput..