Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL

Multiresolution filters, analyzing information at different scales, are crucial for many applications in digital image processing. The different space and time complexity at distinct scales in the unique pyramidal structure poses a challenge as well as an opportunity to implementations on modern accelerators such as GPUs with an increasing number of compute units. In this paper, we exploit the potential of concurrent kernel execution in multiresolution filters. As a major contribution, we present a model-based approach for performance analysis of as well single- as multi-stream implementations, combining both application- and architecture-specific knowledge. As a second contribution, the involved transformations and code generators using CUDA streams on Nvidia GPUs have been integrated into a compiler-based approach using an image processing DSL called Hipacc. We then apply our approach to evaluate and compare the achieved performance for four real-world applications on three GPUs. The results show that our method can achieve a geometric mean speedup of up to 2.5 over the original Hipacc implementation without our approach, up to 2.0 over the other state-of-the-art DSL Halide, and up to 1.3 over the recently released programming model CUDA Graph from Nvidia.

[1]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[2]  Jürgen Teich,et al.  From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Yun Liang,et al.  Efficient GPU Spatial-Temporal Multitasking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[4]  Torsten Hoefler,et al.  Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Won Woo Ro,et al.  Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Fue-Sang Lien,et al.  Parallel Adaptive Mesh Refinement Combined with Additive Multigrid for the Efficient Solution of the Poisson Equation , 2012 .

[7]  Hao Li,et al.  Performance modeling in CUDA streams — A means for high-throughput data processing , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[8]  Jürgen Teich,et al.  HIPAcc: A Domain-Specific Language and Compiler for Image Processing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[9]  Jan Modersitzki,et al.  FAIR: Flexible Algorithms for Image Registration , 2009 .

[10]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[11]  Edward H. Adelson,et al.  A multiresolution spline with application to image mosaics , 1983, TOGS.

[12]  Robert J. Harrison,et al.  A Domain-Specific Compiler for a Parallel Multiresolution Adaptive Numerical Simulation Environment , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Michael Unser,et al.  Multiresolution image registration procedure using spline pyramids , 1993, Optics & Photonics.

[14]  Ming Yang,et al.  Inferring the Scheduling Policies of an Embedded CUDA GPU , 2017 .

[15]  Jan Kautz,et al.  Local Laplacian filters , 2015, Commun. ACM.

[16]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[17]  Edward H. Adelson,et al.  The Laplacian Pyramid as a Compact Image Code , 1983, IEEE Trans. Commun..

[18]  Rami G. Melhem,et al.  Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19]  Huiyang Zhou,et al.  Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution , 2019, ACM Trans. Archit. Code Optim..

[20]  Ming Zhang,et al.  Multiresolution Bilateral Filtering for Image Denoising , 2008, IEEE Transactions on Image Processing.

[21]  Jürgen Teich,et al.  Towards a performance-portable description of geometric multigrid algorithms using a domain-specific language , 2014, J. Parallel Distributed Comput..

[22]  Til Aach,et al.  Nonlinear multiresolution gradient adaptive filter for medical images , 2003, SPIE Medical Imaging.

[23]  Jan Kautz,et al.  Local Laplacian filters: edge-aware image processing with a Laplacian pyramid , 2011, ACM Trans. Graph..

[24]  Scott A. Mahlke,et al.  Dynamic Resource Management for Efficient Utilization of Multitasking GPUs , 2017, ASPLOS.

[25]  Roberto Manduchi,et al.  Bilateral filtering for gray and color images , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[26]  Jürgen Teich,et al.  Automatic Kernel Fusion for Image Processing DSLs , 2018, SCOPES.