An Efficient Abortable-locking Protocol for Multi-level NUMA Systems

The popularity of Non-Uniform Memory Access (NUMA) architectures has led to numerous locality-preserving hierarchical lock designs, such as HCLH, HMCS, and cohort locks. Locality-preserving locks trade fairness for higher throughput. Hence, some instances of acquisitions can incur long latencies, which may be intolerable for certain applications. Few locks admit a waiting thread to abandon its protocol on a timeout. State-of-the-art abortable locks are not fully locality aware, introduce high overheads, and unsuitable for frequent aborts. Enhancing locality-aware locks with lightweight timeout capability is critical for their adoption. In this paper, we design and evaluate the HMCS-T lock, a Hierarchical MCS (HMCS) lock variant that admits a timeout. HMCS-T maintains the locality benefits of HMCS while ensuring aborts to be lightweight. HMCS-T offers the progress guarantee missing in most abortable queuing locks. Our evaluations show that HMCS-T offers the timeout feature at a moderate overhead over its HMCS analog. HMCS-T, used in an MPI runtime lock, mitigated the poor scalability of an MPI+OpenMP BFS code and resulted in 4.3x superior scaling.

[1]  Mark Moir,et al.  Composite Abortable Locks , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[2]  Gerard J. Holzmann,et al.  The Model Checker SPIN , 1997, IEEE Trans. Software Eng..

[3]  Torsten Hoefler,et al.  Scalable communication protocols for dynamic sparse data exchange , 2010, PPoPP '10.

[4]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[5]  John M. Mellor-Crummey,et al.  High performance locks for multi-level NUMA systems , 2015, PPoPP.

[6]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[7]  Satoshi Matsuoka,et al.  Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[8]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[9]  Satoshi Matsuoka,et al.  MPI+Threads: runtime contention and remedies , 2015, PPOPP.

[10]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[11]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[12]  A. Amer 1 Locking Aspects in Multithreaded MPI Implementations , 2016 .

[13]  Prasad Jayanti,et al.  Adaptive and efficient abortable mutual exclusion , 2003, PODC '03.

[14]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[15]  Michael L. Scott,et al.  Non-blocking timeout in scalable queue-based spin locks , 2002, PODC '02.

[16]  John M. Mellor-Crummey,et al.  Contention-conscious, locality-preserving locks , 2016, PPoPP.

[17]  Philipp Woelfel,et al.  RMR-Efficient Randomized Abortable Mutual Exclusion , 2012 .

[18]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[19]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.