Queue Delegation Locking

The scalability of parallel programs is often bounded by the performance of synchronization mechanisms used to protect critical sections. The performance of these mechanisms is in turn determined by their sequential execution time, efficient use of hardware, and ability to avoid waiting. In this article, we describe queue delegation (QD) locking, a family of locks that both delegate critical sections and enable detaching execution. Threads delegate work to the thread currently holding the lock and are able to detach, i.e., immediately continue their execution until they need a result from a previously delegated critical section. We show how to use queue delegation to build synchronization algorithms with lower overhead and higher throughput than existing algorithms, even when critical sections need to communicate results back immediately. Experiments when using up to 64 threads to access a shared priority queue show that QD locking provides 10 times higher throughput than Pthreads mutex locks and outperforms leading delegation algorithms. Also, when mixing parallel reads with delegated write operations, QD locking outperforms competing algorithms with an advantage ranging from 9.5 up to 207 percent increased throughput. Last but not least, continuing execution instead of waiting for the execution of critical sections leads to increased parallelism and better scalability. As we will see, queue delegation locking uses simple building blocks whose overhead is low even in uncontended use. All these make the technique useful in a wide variety of applications.

[1]  Victor Luchangco,et al.  Scalable reader-writer locks , 2009, SPAA '09.

[2]  John M. Mellor-Crummey,et al.  High performance locks for multi-level NUMA systems , 2015, PPoPP.

[3]  Panagiota Fatourou,et al.  Revisiting the combining synchronization technique , 2012, PPoPP '12.

[4]  Nir Shavit,et al.  NUMA-aware reader-writer locks , 2013, PPoPP '13.

[5]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[6]  Nir Shavit,et al.  TLRW: return of the read-write lock , 2010, SPAA '10.

[7]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[8]  Julia L. Lawall,et al.  Fast and Portable Locking for Multicore Architectures , 2016, ACM Trans. Comput. Syst..

[9]  T. J. Watson,et al.  Fuss , Futexes and Furwocks : Fast Userlevel Locking in Linux Hubertus Franke IBM , 2005 .

[10]  Robert E. Tarjan,et al.  The pairing heap: A new form of self-adjusting heap , 2005, Algorithmica.

[11]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[12]  Maurice Herlihy,et al.  The future(s) of shared data structures , 2014, PODC '14.

[13]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[14]  Michael L. Scott,et al.  Scalable reader-writer synchronization for shared-memory multiprocessors , 1991, PPOPP '91.

[15]  Nir Shavit,et al.  Flat-combining NUMA locks , 2011, SPAA '11.

[16]  Konstantinos Sagonas,et al.  On the scalability of the Erlang term storage , 2013, Erlang '13.

[17]  Bengt Jonsson,et al.  A Skiplist-Based Concurrent Priority Queue with Minimal Memory Contention , 2013, OPODIS.

[18]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[19]  Konstantinos Sagonas,et al.  Delegation Locking Libraries for Improved Performance of Multithreaded Programs , 2014, Euro-Par.

[20]  Vivek Sarkar,et al.  Design, verification and applications of a new read-write lock algorithm , 2012, SPAA '12.

[21]  Stefanos Kaxiras,et al.  Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory , 2015, HPDC.

[22]  Silas Boyd-Wickizer,et al.  OpLog: a library for scaling update-heavy data structures , 2014 .

[23]  Ulrich Drepper,et al.  Futexes Are Tricky , 2004 .

[24]  Mark Moir,et al.  SNZI: scalable NonZero indicators , 2007, PODC '07.

[25]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[26]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[27]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.

[28]  David Lorge Parnas,et al.  Concurrent control with “readers” and “writers” , 1971, CACM.

[29]  Julia L. Lawall,et al.  Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications , 2012, USENIX Annual Technical Conference.

[30]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[31]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[32]  Erik Hagersten,et al.  Hierarchical backoff locks for nonuniform communication architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[33]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[34]  Y. Oyama,et al.  EXECUTING PARALLEL PROGRAMS WITH SYNCHRONIZATION BOTTLENECKS EFFICIENTLY , 1999 .

[35]  Nir Shavit,et al.  A Hierarchical CLH Queue Lock , 2006, Euro-Par.