JANUS: A Compilation System for Balancing Parallelism and Performance in OpenVX

Embedded systems typically do not have enough on-chip memory for entire an image buffer. Programming systems like OpenCV operate on entire image frames at each step, making them use excessive memory bandwidth and power. In contrast, the paradigm used by OpenVX is much more efficient; it uses image tiling, and the compilation system is allowed to analyze and optimize the operation sequence, specified as a compute graph, before doing any pixel processing. In this work, we are building a compilation system for OpenVX that can analyze and optimize the compute graph to take advantage of parallel resources in many-core systems or FPGAs. Using a database of prewritten OpenVX kernels, it automatically adjusts the image tile size as well as using kernel duplication and coalescing to meet a defined area (resource) target, or to meet a specified throughput target. This allows a single compute graph to target implementations with a wide range of performance needs or capabilities, e.g. from handheld to datacenter, that use minimal resources and power to reach the performance target.

[1]  Java Binding,et al.  GNU Linear Programming Kit , 2011 .

[2]  Pat Hanrahan,et al.  Darkroom , 2014, ACM Trans. Graph..

[3]  Pat Hanrahan,et al.  Rigel , 2016, ACM Trans. Graph..

[4]  Luca Benini,et al.  ADRENALINE: An OpenVX Environment to Optimize Embedded Vision Applications on Many-core Accelerators , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[5]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[6]  Monica S. Lam,et al.  Maximizing parallelism and minimizing synchronization with affine transforms , 1997, POPL '97.

[7]  Guy Lemieux,et al.  Soft vector processors with streaming pipelines , 2014, FPGA.

[8]  Jan Gray GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[9]  Vivek Sarkar,et al.  A general framework for iteration-reordering loop transformations , 1992, PLDI '92.

[10]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[12]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[13]  Xuan Yang,et al.  Programming Heterogeneous Systems from an Image Processing DSL , 2016, ACM Trans. Archit. Code Optim..

[14]  Luca Benini,et al.  P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Guy Lemieux,et al.  Automated Space/Time Scaling of Streaming Task Graph , 2016, ArXiv.

[16]  Michael R. Butts,et al.  A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing , 2007 .

[17]  Jason Cong,et al.  Combining module selection and replication for throughput-driven streaming programs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Ioannis Papaefstathiou,et al.  A Fast FPGA-Based 2-Opt Solver for Small-Scale Euclidean Traveling Salesman Problem , 2007 .

[19]  Guy Lemieux,et al.  Exploring automated space/time tradeoffs for OpenVX compute graphs , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).