论文信息 - Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.

[1] Guy E. Blelloch,et al. Vector Models for Data-Parallel Computing , 1990 .

[2] Pedro J. Martín,et al. CUDA Solutions for the SSSP Problem , 2009, ICCS.

[3] Kun Zhou,et al. Real-time KD-tree construction on graphics hardware , 2008, SIGGRAPH Asia '08.

[4] John D. Owens,et al. A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[5] Guy E. Blelloch,et al. Prefix sums and their applications , 1990 .

[6] Michael Garland,et al. Eﬃcient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[7] John D. Owens,et al. General Purpose Computation on Graphics Hardware , 2005, IEEE Visualization.

[8] Yao Zhang,et al. Scan primitives for GPU computing , 2007, GH '07.

[9] Wei Wang,et al. Design and Implementation of GPU-Based Prim's Algorithm , 2011 .

[10] Reinhard Klein,et al. GPU‐based Collision Detection for Deformable Parameterized Surfaces , 2006, Comput. Graph. Forum.

[11] Timo Aila,et al. Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[12] Shubhabrata Sengupta,et al. Efficient Parallel Scan Algorithms for GPUs , 2011 .

[13] Brucek Khailany,et al. CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14] Naga K. Govindaraju,et al. Fast scan algorithms on graphics processors , 2008, ICS '08.