Scheduling Irregular Dataflow Pipelines on SIMD Architectures

Streaming computations often exhibit substantial data parallelism that makes them well-suited to SIMD architectures. However, many such computations also exhibit irregularity, in the form of data-dependent, dynamic data rates, that makes efficient SIMD execution challenging. One aspect of this challenge is the need to schedule execution of a computation realized as a pipeline of stages connected by finite queues. A scheduler must both ensure high SIMD occupancy by gathering queued items into vectors and minimize costs associated with switching execution between stages. In this work, we present the AFIE (Active Full, Inactive Empty) scheduling policy for irregular streaming applications on SIMD processors. AFIE provably groups inputs to each stage of a pipeline into a minimal number of SIMD vectors while incurring a bounded number of switches relative to the best possible policy. These results apply even though irregularity forbids a priori knowledge of how many outputs will be generated from each input to each stage. We have implemented AFIE as an extension to the MERCATOR system [6] for building irregular streaming applications on NVIDIA GPUs. We describe how the AFIE scheduler simplifies MERCATOR's runtime code and empirically measure the new scheduler's improved performance on irregular streaming applications.

[1]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[2]  Paul-Louis George,et al.  Delaunay triangulation and meshing : application to finite elements , 1998 .

[3]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[4]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[5]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[6]  S. R. Sathe,et al.  Solving N-Queens problem on GPU architecture using OpenCL with special reference to synchronization issues , 2012, 2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing.

[7]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[8]  Jeremy Buhler,et al.  MERCATOR: A GPGPU Framework for Irregular Streaming Applications , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[9]  James R. Larus,et al.  SIMD parallelization of applications that traverse irregular data structures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[10]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Jeremy T. Fineman,et al.  Cache-conscious scheduling of streaming pipelines on parallel machines with private caches , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[13]  Nikolaos V. Sahinidis,et al.  GPU-BLAST: using graphics processors to accelerate protein sequence alignment , 2010, Bioinform..

[14]  Gary L. Miller,et al.  Sparse parallel Delaunay mesh refinement , 2007, SPAA '07.

[15]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[16]  Peng Li,et al.  Deadlock avoidance for streaming computations with filtering , 2010, SPAA '10.

[17]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[18]  Mark A. Franklin,et al.  Acceleration of atmospheric Cherenkov telescope signal processing to real-time speed with the Auto-Pipe design system , 2008 .