Lease/release: architectural support for scaling contended data structures

High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs which minimize contention, and several programming techniques have been proposed to mitigate its effects. However, there are currently few architectural mechanisms to allow scaling contended data structures at high thread counts. In this paper, we investigate hardware support for scalable contended data structures. We propose Lease/Release, a simple addition to standard directory-based MSI cache coherence protocols, allowing participants to lease memory, at the granularity of cache lines, by delaying coherence messages for a short, bounded period of time. Our analysis shows that Lease/Release can significantly reduce the overheads of contention for both non-blocking (lock-free) and lock-based data structure implementations, while ensuring that no deadlocks are introduced. We validate Lease/Release empirically on the Graphite multiprocessor simulator, on a range of data structures, including queue, stack, and priority queue implementations, as well as on transactional applications. Results show that Lease/Release consistently improves both throughput and energy usage, by up to 5x, both for lock-free and lock-based data structure designs.

[1]  Srinivas Devadas,et al.  TARDIS: Timestamp based Coherence Algorithm for Distributed Shared Memory , 2015, ArXiv.

[2]  Nir Shavit,et al.  Elimination trees and the construction of pools and stacks: preliminary version , 1995, SPAA '95.

[3]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[4]  Dan Alistarh,et al.  Tight Bounds for Asynchronous Renaming , 2014, J. ACM.

[5]  Ana Sokolova,et al.  Quantitative relaxation of concurrent data structures , 2013, POPL.

[6]  Timothy L. Harris,et al.  A Pragmatic Implementation of Non-blocking Linked-Lists , 2001, DISC.

[7]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[8]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[9]  Michael L. Scott,et al.  Synchronization without contention , 1991, ASPLOS IV.

[10]  Faith Ellen,et al.  Non-blocking binary search trees , 2010, PODC.

[11]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[12]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[13]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[14]  Gadi Taubenfeld Shared Memory Synchronization , 2008, Bull. EATCS.

[15]  Nir Shavit,et al.  Transient blocking synchronization , 2005 .

[16]  William Pugh,et al.  Concurrent maintenance of skip lists , 1990 .

[17]  James R. Goodman,et al.  Inferential Queueing and Speculative Push , 2003, ICS '03.

[18]  Charles E. Leiserson A simple deterministic algorithm for guaranteeing the forward progress of transactions , 2016, Inf. Syst..

[19]  Srinivas Devadas,et al.  Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[20]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[21]  Nir Shavit,et al.  Skiplist-based concurrent priority queues , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[22]  Yehuda Afek,et al.  Fast and scalable rendezvousing , 2013, Distributed Computing.

[23]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..

[24]  Neeraj Mittal,et al.  Fast concurrent lock-free binary search trees , 2014, PPoPP.

[25]  James R. Goodman,et al.  Improving the throughput of synchronization by insertion of delays , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[26]  Keir Fraser,et al.  Practical lock-freedom , 2003 .

[27]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[28]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[29]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[30]  Danny Hendler,et al.  Lightweight Contention Management for Efficient Compare-and-Swap Operations , 2013, Euro-Par.

[31]  Peter Sanders,et al.  MultiQueues: Simple Relaxed Concurrent Priority Queues , 2015, SPAA.

[32]  Panagiota Fatourou,et al.  A highly-efficient wait-free universal construction , 2011, SPAA '11.

[33]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[34]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[35]  Nir Shavit,et al.  On the Inherent Sequentiality of Concurrent Objects , 2012, SIAM J. Comput..

[36]  Maged M. Michael,et al.  Quantitative comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8 , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[37]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[38]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[39]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[40]  Tudor David,et al.  Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures , 2015, ASPLOS.

[41]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[42]  Eran Yahav,et al.  Practical concurrent binary search trees via logical ordering , 2014, PPoPP '14.

[43]  Radu Teodorescu,et al.  Flexible Error Protection for Energy Efficient Reliable Architectures , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing.

[44]  Keir Fraser,et al.  A Practical Multi-word Compare-and-Swap Operation , 2002, DISC.

[45]  Michel Raynal,et al.  A speculation‐friendly binary search tree , 2012, PPoPP '12.

[46]  Dan Alistarh,et al.  The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[47]  Omer Khan,et al.  CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.

[48]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[49]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.