VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU

Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel executions, lacks an efficient mechanism to support pipeline programming and executions. This paper provides a systematic examination of various existing pipeline execution models on GPU, and analyzes their strengths and weaknesses. To address their shortcomings, this paper then proposes three new execution models equipped with much improved controllability, including a hybrid model that is capable of getting the strengths of all. These insights ultimately lead to the development of a software programming framework named VersaPipe. With VersaPipe, users only need to write the operations for each pipeline stage. VersaPipe will then automatically assemble the stages into a hybrid execution model and configure it to achieve the best performance. Experiments on a set of pipeline benchmarks and a real-world face detection application show that VersaPipe produces up to $6.90 \times (2.88 \times$ on average) speedups over the original manual implementations. CCS CONCEPTS • General and reference $\rightarrow$ Performance; • Computing methodologies $\rightarrow$ Parallel computing methodologies; • Computer systems organization $\rightarrow$ Heterogeneous (hybrid) systems;

[1]  Dong Li,et al.  Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations , 2015, ICS.

[2]  John Kim,et al.  Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[3]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[4]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[5]  Guoyang Chen,et al.  Free launch: Optimizing GPU dynamic kernel launches through thread reuse , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Pat Hanrahan,et al.  GRAMPS: A programming model for graphics pipelines , 2009, ACM Trans. Graph..

[7]  J. Hess,et al.  Calculation of potential flow about arbitrary bodies , 1967 .

[8]  Malcolm Kesson Pixar's RenderMan , 2008, SIGGRAPH Asia '08.

[9]  Anjul Patney,et al.  Piko: a framework for authoring programmable graphics pipelines , 2015, ACM Trans. Graph..

[10]  Dejan S. Milojicic,et al.  KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[12]  Kai Li,et al.  Characteristics of workloads using the pipeline programming model , 2010, ISCA'10.

[13]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[14]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[15]  Dieter Schmalstieg,et al.  Softshell , 2012, ACM Transactions on Graphics.

[16]  Dieter Schmalstieg,et al.  Whippletree , 2014, ACM Trans. Graph..

[17]  Zhen Lin,et al.  Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[19]  Matti Pietikäinen,et al.  Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Edward H. Adelson,et al.  PYRAMID METHODS IN IMAGE PROCESSING. , 1984 .

[21]  Oscar C. Au,et al.  Video Coding on Multicore Graphics Processors , 2010, IEEE Signal Processing Magazine.

[22]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Yunsong Li,et al.  A GPU-Accelerated Wavelet Decompression System With SPIHT and Reed-Solomon Decoding for Satellite Images , 2011, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[24]  Satoshi Takahashi,et al.  Parallel implementation of saliency maps for real-time robot vision , 2014, 2014 14th International Conference on Control, Automation and Systems (ICCAS 2014).

[25]  Youngmin Yi,et al.  Real-time face detection in Full HD images exploiting both embedded CPU and GPU , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[26]  Sudhakar Yalamanchili,et al.  Characterization and analysis of dynamic parallelism in unstructured GPU applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Robert Ricci,et al.  Fast and flexible: Parallel packet processing with GPUs and click , 2013, Architectures for Networking and Communications Systems.

[28]  Anjul Patney,et al.  Task management for irregular-parallel workloads on the GPU , 2010, HPG '10.

[29]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[30]  Jeffrey F. Naughton,et al.  Multiprocessor Main Memory Transaction Processing , 1988, Proceedings [1988] International Symposium on Databases in Parallel and Distributed Systems.

[31]  Wenguang Chen,et al.  Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures , 2017, IEEE Transactions on Parallel and Distributed Systems.

[32]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[33]  L CookRobert,et al.  The Reyes image rendering architecture , 1987 .

[34]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[35]  Pat Hanrahan,et al.  Ray tracing on a connection machine , 1988, ICS '88.

[36]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[37]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[38]  HanrahanPat,et al.  Ray tracing on programmable graphics hardware , 2002 .

[39]  Robert L. Cook,et al.  The Reyes image rendering architecture , 1987, SIGGRAPH.

[40]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[41]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[42]  John Kim,et al.  iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[43]  Timo Aila,et al.  Megakernels considered harmful: wavefront path tracing on GPUs , 2013, HPG '13.

[44]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  Alexei A. Efros,et al.  What makes Paris look like Paris? , 2015, Commun. ACM.

[46]  Keshav Pingali,et al.  A compiler for throughput optimization of graph algorithms on GPUs , 2016, OOPSLA.

[47]  Kevin Skadron,et al.  Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[48]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[49]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[50]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[51]  David K. McAllister,et al.  OptiX: a general purpose ray tracing engine , 2010, ACM Trans. Graph..

[52]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.