Lock Cohorting

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful. Lock cohorting allows one to transform any spin-lock algorithm, with minimal nonintrusive changes,into a scalable NUMA-aware spin-lock. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability. We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.

[1]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[2]  Marios C. Papaefthymiou,et al.  Computational sprinting , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[3]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[4]  A. Agarwal,et al.  Adaptive backoff synchronization techniques , 1989, ISCA '89.

[5]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[6]  Nitin Garg,et al.  Light-weight Locks , 2011, ArXiv.

[7]  Stijn Eyerman,et al.  Modeling critical sections in Amdahl's law and its implications for multicore design , 2010, ISCA '10.

[8]  Stephan Diestelhorst,et al.  From Lightweight Hardware Transactional Memory to Lightweight Lock , 2011 .

[9]  Konstantinos Sagonas,et al.  Queue delegation locking , 2014, SPAA.

[10]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[11]  James Goodman,et al.  MESIF: A Two-Hop Cache Coherency Protocol for Point-to-Point Interconnects (2004) , 2004 .

[12]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[13]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[14]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[15]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[16]  Francesco Zappa Nardelli,et al.  x86-TSO , 2010, Commun. ACM.

[17]  Nir Shavit,et al.  Flat-combining NUMA locks , 2011, SPAA '11.

[18]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[19]  Erik Hagersten,et al.  Hierarchical backoff locks for nonuniform communication architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[20]  B GibbonsPhillip ACM transactions on parallel computing , 2014 .

[21]  Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B Server Architecture , 2011 .

[22]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[23]  Yujie Liu,et al.  Transactionalizing legacy code: an experience report using GCC and Memcached , 2014, ASPLOS.

[24]  Y. Oyama,et al.  EXECUTING PARALLEL PROGRAMS WITH SYNCHRONIZATION BOTTLENECKS EFFICIENTLY , 1999 .

[25]  Nir Shavit,et al.  A Hierarchical CLH Queue Lock , 2006, Euro-Par.

[26]  Robert Morris,et al.  Non-scalable locks are dangerous , 2012 .

[27]  Panagiota Fatourou,et al.  Revisiting the combining synchronization technique , 2012, PPoPP '12.

[28]  Nir Shavit,et al.  NUMA-aware reader-writer locks , 2013, PPoPP '13.

[29]  Tianyu Wo,et al.  A Flexible and Scalable Affinity Lock for the Kernel , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[30]  Michael L. Scott,et al.  Non-blocking timeout in scalable queue-based spin locks , 2002, PODC '02.

[31]  Michael L. Scott,et al.  Shared-Memory Synchronization , 2013, Shared-Memory Synchronization.

[32]  Avi Mendelson,et al.  The effect of seance communication on multiprocessing systems , 2001, TOCS.

[33]  Julia L. Lawall,et al.  Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications , 2012, USENIX Annual Technical Conference.

[34]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[35]  David Dice,et al.  Mostly lock-free malloc , 2002, MSP/ISMM.

[36]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[37]  Maurice Herlihy,et al.  Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores , 2013, OPODIS.

[38]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[39]  David Dice,et al.  Brief announcement: a partitioned ticket lock , 2011, SPAA '11.

[40]  Ryan Johnson,et al.  Decoupling contention management from scheduling , 2010, ASPLOS XV.

[41]  Neil J. Gunther,et al.  A Methodology for Optimizing Multithreaded System Scalability on Multi-cores , 2011, ArXiv.

[42]  Nikolaos Kallimanis,et al.  Efficient synchronization techniques for shared memory systems , 2013 .

[43]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.