Fast synchronization on shared-memory multiprocessors: An architectural approach

Synchronization is a crucial operation in many parallel applications. Conventional synchronization mechanisms are failing to keep up with the increasing demand for efficient synchronization operations as systems grow larger and network latency increases. The contributions of this paper are threefold. First, we revisit some representative synchronization algorithms in light of recent architecture innovations and provide an example of how the simplifying assumptions made by typical analytical models of synchronization mechanisms can lead to significant performance estimate errors. Second, we present an architectural innovation called active memory that enables very fast atomic operations in a shared-memory multiprocessor. Third, we use execution-driven simulation to quantitatively compare the performance of a variety of synchronization mechanisms based on both existing hardware techniques and active memory operations. To the best of our knowledge, synchronization based on active memory outforms all existing spinlock and non-hardwired barrier implementations by a large margin.

[1]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[2]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[3]  V. Clower Latency , 1979 .

[4]  Dimitrios S. Nikolopoulos,et al.  The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors , 2001, International Journal of Parallel Programming.

[5]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[6]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[7]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[8]  Larry Rudolph,et al.  Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors , 1983, TOPL.

[9]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[10]  Evangelos P. Markatos,et al.  The effects of multiprogramming on barrier synchronization , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[11]  A. Gottlieb,et al.  Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches , 2004 .

[12]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[13]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[14]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[15]  Burton J. Smith,et al.  The Horizon supercomputing system: architecture and software , 1988, Proceedings. SUPERCOMPUTING '88.

[16]  Maged M. Michael,et al.  Coherence controller architectures for SMP-based CC-NUMA multiprocessors , 1997, ISCA '97.

[17]  D. Burger,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[18]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[19]  Fabrizio Petrini,et al.  Scalable collective communication on the ASCI Q machine , 2003, 11th Symposium on High Performance Interconnects, 2003. Proceedings..

[20]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[21]  Rajiv Gupta,et al.  A scalable implementation of barrier synchronization using an adaptive combining tree , 1990, International Journal of Parallel Programming.

[22]  Òòòðð,et al.  Shared-memory Mutual Exclusion: Major Research Trends Since 1986 , 1986 .

[23]  Michael L. Scott,et al.  Fast, contention-free combining tree barriers for shared-memory multiprocessors , 1994, International Journal of Parallel Programming.

[24]  John L. Hennessy,et al.  Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation , 2003, IEEE Trans. Computers.

[25]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[26]  Dhabaleswar K. Panda,et al.  MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems , 2001, IEEE Trans. Parallel Distributed Syst..

[27]  Shisheng Shang,et al.  Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters , 1995, IEEE Trans. Parallel Distributed Syst..

[28]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[29]  Nian-Feng Tzeng,et al.  Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987, IEEE Transactions on Computers.

[30]  Maged M. Michael,et al.  Coherence Controller Architectures For Smp-based Cc-numa Multiprocessors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[31]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[32]  Dhabaleswar K. Panda,et al.  Efficient barrier using remote memory operations on VIA-based clusters , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[33]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.