A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads

GPU compute devices have become very popular for general purpose computations. However, the SIMD-like hardware of graphics processors is currently not well suited for irregular workloads, like searching unbalanced trees. In order to mitigate this drawback, NVIDIA introduced an extension to GPU programming models called Dynamic Parallelism. This extension enables GPU programs to spawn new units of work directly on the GPU, allowing the refinement of subsequent work items based on intermediate results without any involvement of the main CPU. This work investigates methods for employing Dynamic Parallelism with the goal of improved workload distribution for tree search algorithms on modern GPU hardware. For the evaluation of the proposed approaches, a case study is conducted on the N-Queens problem. Extensive benchmarks indicate that the benefits of improved resource utilization fail to outweigh high management overhead and runtime limitations due to the very fine level of granularity of the investigated problem. However, novel memory management concepts for passing parameters to child grids are presented. These general concepts are applicable to other, more coarse-grained problems that benefit from the use of Dynamic Parallelism.

[1]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Brett Stevens,et al.  A survey of known results and research areas for n-queens , 2009, Discret. Math..

[3]  Andreas Polze,et al.  NQueens on CUDA: Optimization Issues , 2010, 2010 Ninth International Symposium on Parallel and Distributed Computing.

[4]  Vivek Sarkar,et al.  Dynamic Task Parallelism with a GPU Work-Stealing Runtime System , 2011, LCPC.

[5]  Timothy J. Rolfe A specimen MPI application: N-Queens in parallel , 2008, SGCS.

[6]  John D. Owens,et al.  A GPU Task-Parallel Model with Dependency Resolution , 2012, Computer.

[7]  M. Steinberger,et al.  ScatterAlloc: Massively parallel dynamic memory allocation for the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[8]  S. R. Sathe,et al.  Solving N-Queens problem on GPU architecture using OpenCL with special reference to synchronization issues , 2012, 2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing.

[9]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[10]  Murat M. Tanik,et al.  Different perspectives of the N-Queens problem , 1992, CSC '92.

[11]  Takahiro Katagiri,et al.  Solving the 24-queens Problem using MPI on a PC Cluster , 2004 .

[12]  Michael Goesele,et al.  Fast dynamic memory allocator for massively parallel architectures , 2013, GPGPU@ASPLOS.

[13]  L. R. Foulds,et al.  An Application of Graph Theory and Integer Programming: Chessboard Non-attacking Puzzles , 1984 .

[14]  Stephen Jones,et al.  Scalable SIMD-parallel memory allocation for many-core machines , 2013, The Journal of Supercomputing.

[15]  Denis Caromel,et al.  Peer-to-peer for computational grids: mixing clusters and desktop machines , 2007, Parallel Comput..