Hierarchical backoff locks for nonuniform communication architectures

This paper identifies node affinity as an important property for scalable general-purpose locks. Nonuniform communication architectures (NUCA), for example CC-NUMA built from a few large nodes or from chip multiprocessors (CMP), have a lower penalty for reading data from a neighbor's cache than from a remote cache. Lock implementations that encourages handing over locks to neighbors will improve the lock handover time, as well as the access to the critical data guarded by the lock, but will also be vulnerable to starvation. We propose a set of simple software-based hierarchical backoff locks (HBO) that create node affinity in NUCA. A solution for lowering the risk of starvation is also suggested. The HBO locks are compared with other software-based lock implementations using simple benchmarks, and are shown to be very competitive for uncontested locks while being more than twice as fast for contended locks. An application study also demonstrates superior performance for applications with high lock contention and competitive performance for other programs.

[1]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[2]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[4]  Erik Hagersten,et al.  Gigaplane: A High Performance Bus for Large SMPs , 2003 .

[5]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[6]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[7]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[8]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[9]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[10]  Marc Levoy,et al.  Parallel visualization algorithms: performance and architectural implications , 1994, Computer.

[11]  James R. Goodman,et al.  Improving the throughput of synchronization by insertion of delays , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[12]  GuptaAnoop,et al.  Parallel Visualization Algorithms , 1994 .

[13]  Erik Hagersten,et al.  Efficient Synchronization for Nonuniform Communication Architectures , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[14]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[15]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[16]  Michael L. Scott,et al.  Non-blocking timeout in scalable queue-based spin locks , 2002, PODC '02.

[17]  Shreekant S. Thakkar,et al.  Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[18]  Lisa Noordergraaf,et al.  Performance experiences on Sun's Wildfire prototype , 1999, SC '99.

[19]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[20]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[21]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[22]  Erik Hagersten,et al.  Efficient Software Synchronization on Large Cache Coherent Multiprocessors , 1994 .

[23]  Tom Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[24]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[25]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[26]  Kourosh Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, SIGP.

[27]  Erik Hagersten,et al.  WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[28]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[29]  A. Charlesworth The Sun Fireplane System Interconnect , 2001, ACM/IEEE SC 2001 Conference (SC'01).