Parallel Combining: Making Use of Free Cycles

There are two intertwined factors that affect performance of concurrent data structures: the ability of processes to access the shared data in parallel and the cost of synchronization. It has been observed that for a class of "concurrency-averse" data structures, the use of fine-grained locking for parallelization does not pay off: an implementation based on a single global lock outperforms fine-grained solutions. The combining paradigm exploits this by ensuring that a thread holding the global lock combines requests and then executes the combined requests sequentially on behalf of other (waiting) concurrent threads. The downside here is that the waiting threads are unused even when concurrently applied requests can be potentially performed in parallel. In this paper, we propose parallel combining, a technique that leverages the computational power of waiting threads. The idea is that the combiner thread assigns waiting threads to perform requests synchronously using a parallel algorithm. We discuss two applications of the technique. First, we use it to transform a sequential data structure into a concurrent one optimized for read-dominated workloads. Second, we use it to construct a concurrent data structure from a batched one that allows synchronous invocations of sets of operations. In both cases, we obtain significant performance gains with respect to the state-of-the-art algorithms

[1]  Michel Raynal,et al.  Specifying Concurrent Problems: Beyond Linearizability and up to Tasks - (Extended Abstract) , 2015, DISC.

[2]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[3]  Nir Shavit,et al.  Combining Funnels: A Dynamic Approach to Software Combining , 2000, J. Parallel Distributed Comput..

[4]  Erez Petrank,et al.  CBPQ: High Performance Lock-Free Priority Queue , 2016, Euro-Par.

[5]  Michel Raynal,et al.  A Contention-Friendly Binary Search Tree , 2013, Euro-Par.

[6]  Phillip B. Gibbons A more practical PRAM model , 1989, SPAA '89.

[7]  Panagiota Fatourou,et al.  Revisiting the combining synchronization technique , 2012, PPoPP '12.

[8]  Gil Neiger,et al.  Set-linearizability , 1994, PODC '94.

[9]  Nian-Feng Tzeng,et al.  Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987, IEEE Transactions on Computers.

[10]  Xavier Messeguer Peypoch,et al.  Height-relaxed AVL rebalancing: a unified, fine-grained approach to concurrent dictionaries , 1998 .

[11]  Jesper Larsson Träff,et al.  A Parallel Priority Queue with Constant Time Operations , 1998, J. Parallel Distributed Comput..

[12]  Nir Shavit,et al.  Skiplist-based concurrent priority queues , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[13]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[14]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[15]  Peter Sanders,et al.  Randomized Priority Queues for Fast Parallel Access , 1998, J. Parallel Distributed Comput..

[16]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[17]  Eran Yahav,et al.  Practical concurrent binary search trees via logical ordering , 2014, PPoPP '14.

[18]  Peter Sanders,et al.  A Bulk-Parallel Priority Queue in External Memory with STXXL , 2015, SEA.

[19]  Narsingh Deo,et al.  Parallel heap: An optimal parallel priority queue , 2004, The Journal of Supercomputing.

[20]  Maurice Herlihy,et al.  A Lazy Concurrent List-Based Set Algorithm , 2007, Parallel Process. Lett..

[21]  Nir Shavit,et al.  Scalable Flat-Combining Based Synchronous Queues , 2010, DISC.

[22]  Geppino Pucci,et al.  Parallel Priority Queues , 1991, Inf. Process. Lett..

[23]  Rajiv Gupta,et al.  A scalable implementation of barrier synchronization using an adaptive combining tree , 1990, International Journal of Parallel Programming.

[24]  Y. Oyama,et al.  EXECUTING PARALLEL PROGRAMS WITH SYNCHRONIZATION BOTTLENECKS EFFICIENTLY , 1999 .

[25]  Bengt Jonsson,et al.  A Skiplist-Based Concurrent Priority Queue with Minimal Memory Contention , 2013, OPODIS.

[26]  Timothy L. Harris,et al.  A Pragmatic Implementation of Non-blocking Linked-Lists , 2001, DISC.

[27]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[28]  HolmJacob,et al.  Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity , 2001 .

[29]  Gaston H. Gonnet,et al.  Heaps on Heaps , 1982, SIAM J. Comput..

[30]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[31]  Erez Petrank,et al.  LCD: Local Combining on Demand , 2014, OPODIS.

[32]  Jacob Nelson,et al.  Flat Combining Synchronized Global Data Structures , 2013 .

[33]  Guy E. Blelloch,et al.  Just Join for Parallel Ordered Sets , 2016, SPAA.

[34]  Umut A. Acar,et al.  Brief Announcement: Parallel Dynamic Tree Contraction via Self-Adjusting Computation , 2017, SPAA.

[35]  Panagiota Fatourou,et al.  A highly-efficient wait-free universal construction , 2011, SPAA '11.

[36]  Nir Shavit,et al.  Diffracting trees , 1996, TOCS.

[37]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.