HipaccVX: wedding of OpenVX and DSL-based code generation

Writing programs for heterogeneous platforms optimized for high performance is hard since this requires the code to be tuned at a low level with architecture-specific optimizations that are most times based on fundamentally differing programming paradigms and languages. OpenVX promises to solve this issue for computer vision applications with a royalty-free industry standard that is based on a graph-execution model. Yet, the OpenVX ’ algorithm space is constrained to a small set of vision functions. This hinders accelerating computations that are not included in the standard. In this paper, we analyze OpenVX vision functions to find an orthogonal set of computational abstractions. Based on these abstractions, we couple an existing domain-specific language (DSL) back end to the OpenVX environment and provide language constructs to the programmer for the definition of user-defined nodes. In this way, we enable optimizations that are not possible to detect with OpenVX graph implementations using the standard computer vision functions. These optimizations can double the throughput on an Nvidia GTX GPU and decrease the resource usage of a Xilinx Zynq FPGA by 50% for our benchmarks. Finally, we show that our proposed compiler framework, called HipaccVX, can achieve better results than the state-of-the-art approaches Nvidia VisionWorks and Halide-HLS.

[1]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.

[2]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[3]  François Berry,et al.  CAPH: a language for implementing stream-processing applications on FPGAs , 2013 .

[4]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[5]  Kari Pulli,et al.  Addressing System-Level Optimization with OpenVX Graphs , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[7]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[8]  LindleySam,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015 .

[9]  James H. Anderson,et al.  Supporting Real-Time Computer Vision Workloads Using OpenVX on Multicore+GPU Platforms , 2015, RTSS.

[10]  Luca Benini,et al.  Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators , 2015, Journal of Real-Time Image Processing.

[11]  Pat Hanrahan,et al.  Rigel , 2016, ACM Trans. Graph..

[12]  Luca Benini,et al.  Enabling OpenVX support in mW-scale parallel accelerators , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[13]  Jürgen Teich,et al.  HIPAcc: A Domain-Specific Language and Compiler for Image Processing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[14]  Uday Bondhugula,et al.  A DSL compiler for accelerating image processing pipelines on FPGAs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[15]  Greg J. Michaelson,et al.  A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs , 2016, ICA3PP Workshops.

[16]  Rigel , 2016 .

[17]  Michael Hübner,et al.  A Design Methodology for the Next Generation Real-Time Vision Processors , 2016, ARC.

[18]  Jürgen Teich,et al.  FPGA-based accelerator design from a domain-specific language , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[19]  J. Teich,et al.  Auto-vectorization for image processing DSLs , 2017, LCTES.

[20]  Jürgen Teich,et al.  Hardware design and analysis of efficient loop coarsening and border handling for image processing , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[21]  Ben Ashbaugh,et al.  OpenCL Interoperability with OpenVX Graphs , 2017, IWOCL.

[22]  Xuan Yang,et al.  Programming Heterogeneous Systems from an Image Processing DSL , 2016, ACM Trans. Archit. Code Optim..

[23]  Jürgen Teich,et al.  Generating FPGA-based image processing accelerators with Hipacc: (Invited paper) , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[24]  Guy G.F. Lemieux,et al.  JANUS: A Compilation System for Balancing Parallelism and Performance in OpenVX , 2018 .

[25]  Ming Yang,et al.  Making OpenVX Really "Real Time" , 2018, 2018 IEEE Real-Time Systems Symposium (RTSS).

[26]  Gunar Schirner,et al.  DS-DSE: Domain-specific design space exploration for streaming applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[27]  Alexander V. Veidenbaum,et al.  Acceleration Framework for FPGA Implementation of OpenVX Graph Pipelines , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[28]  Hossein Omidian,et al.  An Accelerated OpenVX Overlay for Pure Software Programmers , 2018, 2018 International Conference on Field-Programmable Technology (FPT).

[29]  Jürgen Teich,et al.  From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[30]  Jürgen Teich,et al.  The Best of Both Worlds: Combining CUDA Graph with an Image Processing DSL , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).