OpenVX Graph Optimization for Visual Processor Units

OpenVX is a standardized, cross-platform software framework to aid in development of accelerated computer vision, machine learning, and other signal processing applications. Designed for performance optimization, OpenVX allows the programmer to define an application using a graph-based programming model, where the nodes are selected from a repertoire of pre-defined kernels and the edges represent the flow of successive images between pairs of kernels. The graph-based representation exposes spatial and temporal concurrency and provides tuning opportunities to the managing runtime library. In this paper, we present a performance model-based approach for optimizing the execution of OpenVX graphs on the Texas Instruments C66x Digital Signal Processor (DSP), which has similar characteristics to other widespread DSPs such as the Qualcomm Hexagon, Nvidia Programmable Vision Accelerator, and Google Visual Pixel Core. Our approach involves training performance models to predict the impact of tile size and node merging on performance and DRAM utilization. We evaluate our models against randomly-generated, valid, and executable OpenVX graphs.

[1]  Victor Cheng,et al.  Novel OpenVX implementation for heterogeneous multi-core systems , 2017, 2017 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia).

[2]  Roger Reynaud,et al.  Optimal Performance Prediction of ADAS Algorithms on Embedded Parallel Architectures , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[3]  James H. Anderson,et al.  Supporting Real-Time Computer Vision Workloads Using OpenVX on Multicore+GPU Platforms , 2015, 2015 IEEE Real-Time Systems Symposium.

[4]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[5]  Alain Mérigot,et al.  Investigation and performance analysis of OpenVX optimizations on computer vision applications , 2016, 2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV).

[6]  Kari Pulli,et al.  Addressing System-Level Optimization with OpenVX Graphs , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[7]  Zoran Nikolic,et al.  TDA2X, a SoC optimized for advanced driver assistance systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Edoardo Fusella,et al.  Joint communication scheduling and interconnect synthesis for FPGA-based many-core systems , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Sudhakar Yalamanchili,et al.  Modeling GPU-CPU workloads and systems , 2010, GPGPU-3.

[10]  Guy G.F. Lemieux,et al.  JANUS: A Compilation System for Balancing Parallelism and Performance in OpenVX , 2018 .

[11]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[12]  Alexander Mendiburu,et al.  A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based Computing , 2015, IEEE Transactions on Parallel and Distributed Systems.

[13]  Jürgen Teich,et al.  Automatic Kernel Fusion for Image Processing DSLs , 2018, SCOPES.

[14]  Guy Lemieux,et al.  Exploring automated space/time tradeoffs for OpenVX compute graphs , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[15]  J. Ramanujam,et al.  Dynamic selection of tile sizes , 2011, 2011 18th International Conference on High Performance Computing.

[16]  Lucian Codrescu,et al.  Architecture of the Hexagon™ 680 DSP for mobile imaging and computer vision , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[17]  Luca Benini,et al.  ADRENALINE: An OpenVX Environment to Optimize Embedded Vision Applications on Many-core Accelerators , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[18]  Luca Benini,et al.  Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators , 2015, Journal of Real-Time Image Processing.