Hardware design and analysis of efficient loop coarsening and border handling for image processing

Field Programmable Gate Arrays (FPGAs) excel at the implementation of local operators in terms of throughput per energy since the off-chip communication can be reduced with an application-specific on-chip memory configuration. Furthermore, data-level parallelism can efficiently be exploited through socalled loop coarsening, which processes multiple horizontal pixels simultaneously. Moreover, existing solutions for proper border handling in hardware show considerable resource overheads. In this paper, we first propose novel architectures for image border handling and loop coarsening, which can significantly reduce area. Second, we present a systematic analysis of these architectures including the formulation of analytical models for their area usage. Based on these models, we provide an algorithm for suggesting the most efficient hardware architecture for a given specification. Finally, we evaluate several implementations of our proposed architectures obtained through Vivado High-Level Synthesis (HLS). The synthesis results show that the proposed coarsening architecture uses 32% less registers for a 5-by-5 convolution with a 64 coarsening factor compared to previous works, whereas the proposed border handling architectures facilitate a decrease in the Look-up Table (LUT) usage by 36 %.

[1]  Jürgen Teich,et al.  Loop coarsening in C-based High-Level Synthesis , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[2]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[3]  Muhsen Owaida,et al.  Synthesis of Platform Architectures from OpenCL Programs , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[4]  Jason Cong,et al.  Throughput Optimization for High-Level Synthesis Using Resource Constraints , 2014 .

[5]  Fabrizio Ferrandi,et al.  Exploiting Outer Loops Vectorization in High Level Synthesis , 2015, ARCS.

[6]  Implementing FPGA Design with the OpenCL Standard , 2010 .

[7]  Donald G. Bailey Image Border Management for FPGA Based Filters , 2011, 2011 Sixth IEEE International Symposium on Electronic Design, Test and Application.

[8]  Jason Cong,et al.  FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[9]  Paul Feautrier,et al.  Polyhedron Model , 2011, Encyclopedia of Parallel Computing.

[10]  Alain Darte,et al.  Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA , 2012, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Jürgen Teich,et al.  Loop Parallelization Techniques for FPGA Accelerator Synthesis , 2018, J. Signal Process. Syst..

[12]  Mohammad Rafi,et al.  A novel arrangement for efficiently handling image border in FPGA filter implementation , 2016, 2016 3rd International Conference on Signal Processing and Integrated Networks (SPIN).