Lease/Release

High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs that minimize contention, and several programming techniques have been proposed to mitigate its effects. However, there are currently few architectural mechanisms to allow scaling contended data structures at high thread counts. In this article, we investigate hardware support for scalable contended data structures. We propose Lease/Release, a simple addition to standard directory-based MESI cache coherence protocols, allowing participants to lease memory, at the granularity of cache lines, by delaying coherence messages for a short, bounded period of time. Our analysis shows that Lease/Release can significantly reduce the overheads of contention for both non-blocking (lock-free) and lock-based data structure implementations while ensuring that no deadlocks are introduced. We validate Lease/Release empirically on the Graphite multiprocessor simulator on a range of data structures, including queue, stack, and priority queue implementations, as well as on transactional applications. Results show that Lease/Release consistently improves both throughput and energy usage, by up to 5x, both for lock-free and lock-based data structure designs.

[1]  Tudor David,et al.  Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures , 2015, ASPLOS.

[2]  Omer Khan,et al.  CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.

[3]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[4]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[5]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[6]  Peter Sanders,et al.  MultiQueues: Simple Relaxed Concurrent Priority Queues , 2015, SPAA.

[7]  William Pugh,et al.  Concurrent maintenance of skip lists , 1990 .

[8]  Michael L. Scott,et al.  Synchronization without contention , 1991, ASPLOS IV.

[9]  Faith Ellen,et al.  Non-blocking binary search trees , 2010, PODC.

[10]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[11]  D. Burger,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[12]  Yehuda Afek,et al.  Fast and scalable rendezvousing , 2013, Distributed Computing.

[13]  Panagiota Fatourou,et al.  A highly-efficient wait-free universal construction , 2011, SPAA '11.

[14]  Dan Alistarh,et al.  Tight Bounds for Asynchronous Renaming , 2014, J. ACM.

[15]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[16]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[17]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[18]  Charles E. Leiserson A simple deterministic algorithm for guaranteeing the forward progress of transactions , 2016, Inf. Syst..

[19]  Michel Raynal,et al.  A speculation‐friendly binary search tree , 2012, PPoPP '12.

[20]  Dan Alistarh,et al.  The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[21]  Srinivas Devadas,et al.  TARDIS: Timestamp based Coherence Algorithm for Distributed Shared Memory , 2015, ArXiv.

[22]  Nir Shavit,et al.  Elimination trees and the construction of pools and stacks: preliminary version , 1995, SPAA '95.

[23]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[24]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[25]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[26]  James R. Goodman,et al.  Improving the throughput of synchronization by insertion of delays , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[27]  Keir Fraser,et al.  Practical lock-freedom , 2003 .

[28]  Maged M. Michael,et al.  Quantitative comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8 , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[29]  Nir Shavit,et al.  Skiplist-based concurrent priority queues , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[30]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[31]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[32]  Danny Hendler,et al.  Lightweight Contention Management for Efficient Compare-and-Swap Operations , 2013, Euro-Par.

[33]  Michael L. Scott,et al.  Shared-Memory Synchronization , 2013, Shared-Memory Synchronization.

[34]  Nir Shavit,et al.  Transient blocking synchronization , 2005 .

[35]  Song Jiang,et al.  Wormhole: A Fast Ordered Index for In-memory Data Management , 2018 .

[36]  Nir Shavit,et al.  On the Inherent Sequentiality of Concurrent Objects , 2012, SIAM J. Comput..

[37]  Neeraj Mittal,et al.  Fast concurrent lock-free binary search trees , 2014, PPoPP '14.

[38]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[39]  Ana Sokolova,et al.  Quantitative relaxation of concurrent data structures , 2013, POPL.

[40]  Timothy L. Harris,et al.  A Pragmatic Implementation of Non-blocking Linked-Lists , 2001, DISC.

[41]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[42]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[43]  James R. Goodman,et al.  Inferential Queueing and Speculative Push , 2003, ICS '03.