Fast Parallel Stream Compaction for IA-Based Multi/many-core Processors

Stream compaction, frequently found in a large variety of applications, serves as a general primitive that reduces an input stream to a subset containing only the wanted elements so that the follow-on computation can be done efficiently. In this paper, we propose a fast parallel stream compaction for IA-based multi-/many-core processors. Unlike the previously studied algorithms that depend heavily on a black-box parallel scan, we open the black-box in the proposed algorithm and manually tailor it so that both the workload and the memory footprint is significantly reduced. By further eliminating the conditional statements and applying automatic code generation/optimization for performance-critical kernels, the proposed parallel stream compaction achieves high performance in different cases and for various data types across different IA-based multi/manycore platforms. Experimental results on three typical IA-based processors, including a quad-core Core-i7 CPU, a dual-socket 8-core Xeon CPU, and a 61-core Xeon Phi accelerator show that the proposed implementation outperforms the referenced parallel counterpart in the state-of-art library Thrust. On top of the above, we apply it in the random forest based data classifier to show its potential to boost the performance of real-world applications.

[1]  John C. Hart,et al.  Stream compaction for deferred shading , 2009, High Performance Graphics.

[2]  Nan Zhang A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication , 2012, IEEE Transactions on Parallel and Distributed Systems.

[3]  Alexandru Pîrjan Solutions For Optimizing The Stream Compaction Algorithmic Function Using The Compute Unified Device Architecture , 2012 .

[4]  Peter van der Linden Expert C programming - deep C secrets , 1994 .

[5]  Shengen Yan,et al.  StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.

[6]  Ion Lungu,et al.  Solutions For Optimizing The Data Parallel Prefix Sum Algorithm Using The Compute Unified Device Architecture , 2011 .

[7]  Adolfy Hoisie,et al.  Performance Optimization of Numerically Intensive Codes , 1987 .

[8]  Ulf Assarsson,et al.  Efficient stream compaction on wide SIMD many-core architectures , 2009, High Performance Graphics.

[9]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[10]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[11]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[12]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[13]  Chao Yang,et al.  Optimization of scan algorithms on multi- and many-core processors , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[14]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[15]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[16]  Ben Spencer,et al.  InK‐Compact: In‐Kernel Stream Compaction and Its Application to Multi‐Kernel Data Visualization on General‐Purpose GPUs , 2013, Comput. Graph. Forum.

[17]  V. Natoli,et al.  GAMPACK (GPU Accelerated Algebraic Multigrid Package) , 2012 .

[18]  Nicolas Holzschuch,et al.  Efficient stream reduction on the GPU , 2007 .

[19]  John D. Owens,et al.  A Work-Efficient Step-Efficient Prefix Sum Algorithm , 2006 .

[20]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[21]  Hans-Peter Seidel,et al.  GPU point list generation through histogram pyramids , 2006 .

[22]  James R. Larus,et al.  SIMD parallelization of applications that traverse irregular data structures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).