Efficient Parallel Scan Algorithms for GPUs

Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the attening transform, which allows for nested data-parallel programs to be compiled into at data-parallel languages. In this paper, we describe the design of ecient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly ecient, and free of irregular access patterns that lead to memory bank conicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.

[1]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[2]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[3]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[4]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[5]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[6]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[7]  Anselmo Lastra,et al.  Fast Summed‐Area Table Generation and its Applications , 2005, Comput. Graph. Forum.

[8]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[9]  Sanjeev Saxena,et al.  On Parallel Prefix Computation , 1994, Parallel Process. Lett..

[10]  John D. Owens,et al.  A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[11]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[12]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[13]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[14]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[15]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).