Monotonically relaxing concurrent data-structure semantics for performance: An efficient 2D design framework

There has been a significant amount of work in the literature proposing semantic relaxation of concurrent data structures for improving scalability and performance. By relaxing the semantics of a data structure, a bigger design space, that allows weaker synchronization and more useful parallelism, is unveiled. Investigating new data structure designs, capable of trading semantics for achieving better performance in a monotonic way, is a major challenge in the area. We algorithmically address this challenge in this paper. We present an efficient, lock-free, concurrent data structure design framework for out-of-order semantic relaxation. Our framework introduces a new two dimensional algorithmic design, that uses multiple instances of a given data structure. The first dimension of our design is the number of data structure instances operations are spread to, in order to benefit from parallelism through disjoint memory access. The second dimension is the number of consecutive operations that try to use the same data structure instance in order to benefit from data locality. Our design can flexibly explore this two-dimensional space to achieve the property of monotonically relaxing concurrent data structure semantics for achieving better throughput performance within a tight deterministic relaxation bound, as we prove in the paper. We show how our framework can instantiate lock-free out-of-order queues, stacks, counters and dequeues. We provide implementations of these relaxed data structures and evaluate their performance and behaviour on two parallel architectures. Experimental evaluation shows that our two-dimensional data structures significantly outperform the respected previous proposed ones with respect to scalability and throughput performance. Moreover, their throughput increases monotonically as relaxation increases.

[1]  Peter Sanders,et al.  MultiQueues: Simple Relaxed Concurrent Priority Queues , 2015, SPAA.

[2]  Nir Shavit,et al.  The Computability of Relaxed Data Structures: Queues and Stacks as Examples , 2015, SIROCCO.

[3]  Dan Alistarh,et al.  The Power of Choice in Priority Scheduling , 2017, PODC.

[4]  Ana Sokolova,et al.  Local Linearizability for Concurrent Container-Type Data Structures , 2016, CONCUR.

[5]  Nir Shavit Data structures in the multicore age , 2011, CACM.

[6]  Christoph M. Kirsch,et al.  A Scalable, Correct Time-Stamped Stack , 2015, POPL.

[7]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.

[8]  Panagiota Fatourou,et al.  Revisiting the combining synchronization technique , 2012, PPoPP '12.

[9]  Jesper Larsson Träff,et al.  The lock-free k-LSM relaxed priority queue , 2015, PPOPP.

[10]  Yehuda Afek,et al.  Quasi-Linearizability: Relaxed Consistency for Improved Concurrency , 2010, OPODIS.

[11]  Ana Sokolova,et al.  Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation , 2013, CF '13.

[12]  Idit Keidar,et al.  SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools , 2012, SPAA '12.

[13]  Nir Shavit,et al.  A scalable lock-free stack algorithm , 2010, J. Parallel Distributed Comput..

[14]  Dan Alistarh,et al.  The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[15]  John M. Mellor-Crummey,et al.  A wait-free queue as fast as fetch-and-add , 2016, PPoPP.

[16]  Philippas Tsigas,et al.  Brief Announcement: 2D-Stack -- A Scalable Lock-Free Stack Design that Continuously Relaxes Semantics for Better Performance , 2018, PODC.

[17]  Nir Shavit,et al.  Elimination trees and the construction of pools and stacks: preliminary version , 1995, SPAA '95.

[18]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[19]  Mark Moir,et al.  Using elimination to implement scalable and lock-free FIFO queues , 2005, SPAA '05.

[20]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[21]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[22]  Rachid Guerraoui,et al.  Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated , 2011, POPL '11.

[23]  Ana Sokolova,et al.  Quantitative relaxation of concurrent data structures , 2013, POPL.

[24]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[25]  Wolfgang E. Nagel,et al.  Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Nir Shavit,et al.  Combining Funnels: A Dynamic Approach to Software Combining , 2000, J. Parallel Distributed Comput..

[27]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[28]  Jennifer L. Welch,et al.  Relaxed Data Types as Consistency Conditions , 2018, Algorithms.

[29]  Nir Shavit,et al.  Scalable Producer-Consumer Pools Based on Elimination-Diffraction Trees , 2010, Euro-Par.

[30]  Danny Hendler,et al.  A Dynamic Elimination-Combining Stack Algorithm , 2011, OPODIS.

[31]  Maged M. Michael CAS-Based Lock-Free Algorithm for Shared Deques , 2003, Euro-Par.

[32]  Edsger W. Dijkstra,et al.  The structure of the “THE”-multiprogramming system , 1968, CACM.

[33]  Tudor David,et al.  Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures , 2015, ASPLOS.