Design and implementation of a parallel priority queue on many-core architectures

An efficient parallel priority queue is at the core of the effort in parallelizing important non-numeric irregular computations such as discrete event simulation scheduling and branch-and-bound algorithms. GPGPUs can provide powerful computing platform for such non-numeric computations if an efficient parallel priority queue implementation is available. In this paper, aiming at fine-grained applications, we develop an efficient parallel heap system employing CUDA. To our knowledge, this is the first parallel priority queue implementation on many-core architectures, thus represents a breakthrough. By allowing wide heap nodes to enable thousands of simultaneous deletions of highest priority items and insertions of new items, and taking full advantage of CUDA's data parallel SIMT architecture, we demonstrate up to 30-fold absolute speedup for relatively fine-grained compute loads compared to optimized sequential priority queue implementation on fast multicores. Compared to this, our optimized multicore parallelization of parallel heap yields only 2–3 fold speedup for such fine-grained loads. This parallelization of a tree-based data structure on GPGPUs provides a roadmap for future parallelizations of other such data structures.

[1]  Jean Vuillemin,et al.  A data structure for manipulating priority queues , 1978, CACM.

[2]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1984, JACM.

[3]  Douglas W. Jones,et al.  An empirical comparison of priority-queue and event-set implementations , 1986, CACM.

[4]  Vipin Kumar,et al.  Concurrent Access of Priority Queues , 1988, IEEE Trans. Computers.

[5]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[6]  Sushil K. Prasad,et al.  Efficient parallel algorithms and data structures for discrete-event simulation , 1990 .

[7]  Sushil K. Prasad,et al.  Parallel heap: A practical priority queue for fine-to-medium-grained applications on small multiprocessors , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[8]  Gerth Stølting Brodal,et al.  Priority queues on parallel machines , 1996, Parallel Comput..

[9]  David Benson,et al.  Octree textures , 2002, SIGGRAPH.

[10]  Narsingh Deo,et al.  Parallel heap: An optimal parallel priority queue , 2004, The Journal of Supercomputing.

[11]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[12]  Sylvain Lefebvre,et al.  Octree Textures on the GPU , 2005 .

[13]  Sylvain Lefebvre,et al.  Perfect spatial hashing , 2006, SIGGRAPH 2006.

[14]  Hans-Peter Seidel,et al.  Stackless KD‐Tree Traversal for High Performance GPU Ray Tracing , 2007, Comput. Graph. Forum.

[15]  Hans-Peter Seidel,et al.  Real-time quadtree analysis using HistoPyramids , 2007, Electronic Imaging.

[16]  Thomas Lewiner,et al.  Statistical optimization of octree searches , 2008, Comput. Graph. Forum.

[17]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[18]  Kun Zhou,et al.  Real-time KD-tree construction on graphics hardware , 2008, SIGGRAPH 2008.

[19]  Srinivas Aluru,et al.  Fast, parallel, GPU-based construction of space filling curves and octrees , 2008, I3D '08.

[20]  Kun Zhou,et al.  Interactive relighting of dynamic refractive objects , 2008, SIGGRAPH 2008.

[21]  Mayuresh Kunjir,et al.  Using Graphics Processing in Spatial Indexing Algorithms , 2009 .

[22]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23]  David A. Forsyth,et al.  Generalizing motion edits with Gaussian processes , 2009, ACM Trans. Graph..

[24]  Dinesh Manocha,et al.  Fast BVH Construction on GPUs , 2009, Comput. Graph. Forum.

[25]  John D. Owens,et al.  Real-time parallel hashing on the GPU , 2009, SIGGRAPH 2009.

[26]  Håkan Grahn,et al.  A CUDA Implementation of Random Forests : Early Results , 2010 .

[27]  Justin Hensley,et al.  Real‐Time Concurrent Linked List Construction on the GPU , 2010, Comput. Graph. Forum.

[28]  Dinesh Manocha,et al.  Memory-Scalable GPU Spatial Hierarchy Construction , 2011, IEEE Transactions on Visualization and Computer Graphics.

[29]  Dinesh Agarwal Memory Hierarchy Aware Parallel Priority Based Data Structures , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[30]  Kun Zhou,et al.  Data-Parallel Octrees for Surface Reconstruction. , 2011, IEEE transactions on visualization and computer graphics.

[31]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.