The Best of Both Worlds: Combining CUDA Graph with an Image Processing DSL

CUDA graph is an asynchronous task-graph programming model recently released by Nvidia. It encapsulates application workflows in a graph, with nodes being operations connected by dependencies. The new API brings two benefits: Reduced work launch overhead and whole workflow optimizations. In this paper, we improve the ability of CUDA graph to exploit workflow optimizations, e.g., concurrent kernel executions with complementary resource occupancy. Additionally, we argue that the advantages of DSLs are complementary to CUDA graph, and joining the two techniques can benefit from the best of both worlds. Here, we propose a compiler-based approach that combines CUDA graph with an image processing DSL and a source-to-source compiler called Hipacc. For ten image processing applications benchmarked on two Nvidia GPUs, our approach is able to achieve a geometric mean speedup of 1.30 over Hipacc without CUDA graph, 1.11 over CUDA graph without Hipacc, and 3.96 over another state-of-the-art DSL called Halide.

[1]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[2]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[3]  Rami G. Melhem,et al.  Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4]  Uday Bondhugula,et al.  An effective fusion and tile size model for optimizing image processing pipelines , 2018, PPoPP.

[5]  Jürgen Teich,et al.  HIPAcc: A Domain-Specific Language and Compiler for Image Processing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[6]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[7]  Jürgen Teich,et al.  From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[8]  H. Jensen Night Rendering , 2000 .