An adaptive concurrent priority queue for NUMA architectures

Designing scalable concurrent priority queues for contemporary NUMA servers is challenging. Several NUMA-unaware implementations can scale up to a high number of threads exploiting the potential parallelism of the insert operations. In contrast, in deleteMin-dominated workloads, threads compete for accessing the same memory locations, i.e. the first item in the priority queue. In such cases, NUMA-aware implementations are typically used, since they reduce the coherence traffic between the nodes of a NUMA system. In this work, we propose an adaptive priority queue, called SmartPQ, that tunes itself by automatically switching between NUMA-unaware and NUMA-aware algorithmic modes to provide the highest available performance under all workloads. SmartPQ is built on top of NUMA Node Delegation (Nuddle), a low overhead technique to construct NUMA-aware data structures using any arbitrary NUMA-unaware implementation as its backbone. Moreover, SmartPQ employs machine learning to decide when to switch between its two algorithmic modes. As our evaluation reveals, it achieves the highest available performance with 88% success rate and dynamically adapts between a NUMA-aware and a NUMA-unaware mode, without overheads, while performing up to 1.83 times better performance than Spraylist, the state-of-the-art NUMA-unaware priority queue.

[1]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[2]  Maurice Herlihy,et al.  A Simple Optimistic Skiplist Algorithm , 2007, SIROCCO.

[3]  Michel Raynal,et al.  No Hot Spot Non-blocking Skip List , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[4]  Jakob Eriksson,et al.  ffwd: delegation is (much) faster than you think , 2017, SOSP.

[5]  Vladimir Kolmogorov,et al.  Blossom V: a new implementation of a minimum cost perfect matching algorithm , 2009, Math. Program. Comput..

[6]  Eric Ruppert,et al.  Lock-free linked lists and skip lists , 2004, PODC '04.

[7]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[8]  Nir Shavit,et al.  Scalable Flat-Combining Based Synchronous Queues , 2010, DISC.

[9]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[10]  Song Jiang,et al.  Wormhole: A Fast Ordered Index for In-memory Data Management , 2018 .

[11]  Feng Shi,et al.  Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[12]  Jonathan Eastep,et al.  Smart data structures: an online machine learning approach to multicore data structures , 2011, ICAC '11.

[13]  Bengt Jonsson,et al.  A Skiplist-Based Concurrent Priority Queue with Minimal Memory Contention , 2013, OPODIS.

[14]  Timothy L. Harris,et al.  A Pragmatic Implementation of Non-blocking Linked-Lists , 2001, DISC.

[16]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[17]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[18]  Maurice Herlihy,et al.  The Adaptive Priority Queue with Elimination and Combining , 2014, DISC.

[19]  Alessandro Pellegrini,et al.  A Non-Blocking Priority Queue for the Pending Event Set , 2016, SimuTools.

[20]  George Karypis,et al.  Multi-threaded Graph Partitioning , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[21]  Kenli Li,et al.  A genetic algorithm for task scheduling on heterogeneous computing systems using multiple priority queues , 2014, Inf. Sci..

[22]  Keir Fraser,et al.  Practical lock-freedom , 2003 .

[23]  Tudor David,et al.  Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures , 2015, ASPLOS.

[24]  Philippas Tsigas,et al.  Fast and lock-free concurrent priority queues for multi-thread systems , 2005, J. Parallel Distributed Comput..

[25]  Maurice Herlihy,et al.  Concurrent Data Structures for Near-Memory Computing , 2017, SPAA.

[26]  Alan Fekete,et al.  A skip list for multicore , 2017, Concurr. Comput. Pract. Exp..

[27]  Mikkel Thorup,et al.  Integer priority queues with decrease key in constant time and the single source shortest paths problem , 2003, STOC '03.

[28]  Roberto Palmieri,et al.  NUMASK: High Performance Scalable Skip List for NUMA , 2018, DISC.

[29]  R. Prim Shortest connection networks and some generalizations , 1957 .

[30]  Peter Sanders,et al.  MultiQueues: Simpler, Faster, and Better Relaxed Concurrent Priority Queues , 2014, ArXiv.

[31]  Konstantinos Sagonas,et al.  The Contention Avoiding Concurrent Priority Queue , 2016, LCPC.

[32]  Yue Zhao,et al.  Bridging the gap between deep learning and sparse matrix format selection , 2018, PPoPP.

[33]  Neeraj Mittal,et al.  Fast concurrent lock-free binary search trees , 2014, PPoPP.

[34]  Jesper Larsson Träff,et al.  A Parallel Priority Queue with Constant Time Operations , 1998, J. Parallel Distributed Comput..

[35]  Dan Alistarh,et al.  The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[36]  Shane V. Howley,et al.  A non-blocking internal binary search tree , 2012, SPAA '12.

[37]  WAI TENG TANG,et al.  Ladder queue: An O(1) priority queue structure for large-scale discrete event simulation , 2005, TOMC.

[38]  Jesper Larsson Träff,et al.  The lock-free k-LSM relaxed priority queue , 2015, PPOPP.

[39]  Ajay D. Kshemkalyani,et al.  SWIFT: scheduling in web servers for fast response time , 2003, Second IEEE International Symposium on Network Computing and Applications, 2003. NCA 2003..

[40]  Barton P. Miller,et al.  A comparison of interactivity in the Linux 2.6 scheduler and an MLFQ scheduler , 2007, Softw. Pract. Exp..

[41]  Mor Harchol-Balter,et al.  Size-based scheduling to improve web performance , 2003, TOCS.

[42]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[43]  Nir Shavit,et al.  Skiplist-based concurrent priority queues , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[44]  Nectarios Koziris,et al.  Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[45]  Juan Carlos Pichel,et al.  A New Approach for Sparse Matrix Classification Based on Deep Learning Techniques , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[46]  Faith Ellen,et al.  Non-blocking binary search trees , 2010, PODC.

[47]  Deli Zhang,et al.  A Lock-Free Priority Queue Design Based on Multi-Dimensional Linked Lists , 2016, IEEE Transactions on Parallel and Distributed Systems.