Efficient stream compaction on wide SIMD many-core architectures

Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3x speedup over previous published algorithms.

[1]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[2]  T. Mulvey A closer look , 2007, Nature.

[3]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[4]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[5]  John H. Reif,et al.  Synthesis of Parallel Algorithms , 1993 .

[6]  Reinhard Klein,et al.  GPU‐based Collision Detection for Deformable Parameterized Surfaces , 2006, Comput. Graph. Forum.

[7]  Hans-Peter Seidel,et al.  GPU point list generation through histogram pyramids , 2006 .

[8]  John D. Owens,et al.  A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[9]  Nicolas Holzschuch,et al.  Efficient stream reduction on the GPU , 2007 .

[10]  Nicolas Holzschuch,et al.  Whitted Ray-Tracing for Dynamic Scenes using a Ray-Space Hierarchy on the GPU , 2007, Rendering Techniques.

[11]  Andrew Kensler,et al.  SIMD Ray Stream Tracing - SIMD Ray Traversal with Generalized Ray Packets and On-the-fly Re-Ordering - , 2007 .

[12]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[13]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[14]  Kun Zhou,et al.  Real-time KD-tree construction on graphics hardware , 2008, SIGGRAPH Asia '08.

[15]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[16]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[17]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[18]  Dinesh Manocha,et al.  Fast BVH Construction on GPUs , 2009, Comput. Graph. Forum.

[19]  Sun UltraSPARC,et al.  A closer look at GPUs , 2008, Commun. ACM.