Single-pass Parallel Prefix Scan with Decoupled Lookback

We describe a work-efficient, communication-avoiding, singlepass method for the parallel computation of prefix scan. When consuming input from memory, our algorithm requires only ~2n data movement: n inputs are read, n outputs are written. Our method embodies a decoupled look-back strategy that performs redundant work to dissociate local computation from the latencies of global prefix propagation. Implemented by the CUB library of parallel primitives for GPU architectures, the performance throughput of our parallel prefix scan approaches that of copy operations. Furthermore, the single-pass nature of our method allows it to be adapted for (1) in-place compaction behavior, and (2) in-situ global allocation within computations that oversubscribe the processor.

[1]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[2]  Andrew S. Grimshaw,et al.  Allocation-oriented algorithm design with application to gpu computing , 2011 .

[3]  Tack-Don Han,et al.  Fast area-efficient VLSI adders , 1987, 1987 IEEE 8th Symposium on Computer Arithmetic (ARITH).

[4]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[5]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[6]  H. T. Kung,et al.  A Regular Layout for Parallel Adders , 1982, IEEE Transactions on Computers.

[7]  Shubhabrata Sengupta,et al.  Efficient Parallel Scan Algorithms for GPUs , 2011 .

[8]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[9]  Jack Sklansky,et al.  Conditional-Sum Addition Logic , 1960, IRE Trans. Electron. Comput..

[10]  Leslie G. Valiant,et al.  Universal circuits (Preliminary Report) , 1976, STOC '76.

[11]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[12]  Leslie M. Goldschlager,et al.  A universal interconnection pattern for parallel computers , 1982, JACM.

[13]  Allan Borodin,et al.  On Relating Time and Space to Size and Depth , 1977, SIAM J. Comput..

[14]  Andrew S. Grimshaw,et al.  Parallel Scan for Stream Architectures , 2012 .

[15]  Marc Snir,et al.  Depth-Size Trade-Offs for Parallel Prefix Computation , 1986, J. Algorithms.

[16]  Shengen Yan,et al.  StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.

[17]  Tack-Don Han,et al.  A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment , 2013, IEEE Transactions on Parallel and Distributed Systems.

[18]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[19]  Ralf Hinze An Algebra of Scans , 2004, MPC.