Stream compaction, also known as stream filtering or selection, produces a smaller output array which contains the indices of the only wanted elements from the input array for further processing. With the tremendous amount of data elements to be filtered, the performance of selection is of great concern. Recently, modern Graphics Processing Units (GPUs) have been increasingly used to accelerate the execution of massively large, data parallel applications. In this paper, we designed and implemented two new algorithms for stream compaction on GPU. The first algorithm, which can preserve the relative order of the input elements, uses a multi-level prefix-sum approach. The second algorithm, which is non-order-preserving, is based the hybrid use of the prefix-sum and the atomics approaches. We compared their performance with other parallel selection algorithms on the current generation of NVIDIA GPUs. The experimental results show that both algorithms run faster than Thrust, an open-source parallel algorithms library. Furthermore, the hybrid method performs the best among all existing selection algorithms on GPU and can be two orders of magnitude faster than the sequential selection on CPU, especially when the data size is large.
[1]
Ulf Assarsson,et al.
Efficient stream compaction on wide SIMD many-core architectures
,
2009,
High Performance Graphics.
[2]
Jin Wang,et al.
Relational Algebra Algorithms and Data Structures for GPU
,
2012
.
[3]
Chung-Ta King,et al.
A Fast Implementation of Parallel Discrete-Event Simulation on GPGPU
,
2013
.
[4]
Ben Spencer,et al.
InK‐Compact: In‐Kernel Stream Compaction and Its Application to Multi‐Kernel Data Visualization on General‐Purpose GPUs
,
2013,
Comput. Graph. Forum.
[5]
Jie Cheng,et al.
Programming Massively Parallel Processors. A Hands-on Approach
,
2010,
Scalable Comput. Pract. Exp..
[6]
James Christopher Wyllie,et al.
The Complexity of Parallel Computations
,
1979
.
[7]
Kirill Garanzha,et al.
Grid-based SAH BVH construction on a GPU
,
2011,
The Visual Computer.
[8]
Mark J. Harris,et al.
Optimizing Parallel Prefix Operations for the Fermi Architecture
,
2012
.
[9]
Yeh-Ching Chung,et al.
Optimizing Pairwise Box Intersection Checking on GPUs for Large-Scale Simulations
,
2013,
TOMC.