Loop coarsening in C-based High-Level Synthesis

Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP), the support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines, consisting of point and local operators. In addition to well known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows to process multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc by loop coarsening and compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from the exact same code base. Moreover, we demonstrate the advantages of code generation for algorithm development by outlining how design space exploration enabled by HIPAcc can yield a more efficient implementation than hand-coded VHDL.

[1]  Marc Reichenbach,et al.  A Generic VHDL Template for 2D Stencil Code Applications on FPGAs , 2012, 2012 IEEE 15th International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops.

[2]  Martin Odersky,et al.  Making domain-specific hardware synthesis tools cost-efficient , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[3]  Dejan Markovic,et al.  27.5 A multi-granularity FPGA with hierarchical interconnects for efficient and flexible mobile computing , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[4]  Adrian Park,et al.  Designing Modular Hardware Accelerators in C with ROCCC 2.0 , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[5]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[6]  Donald G. Bailey,et al.  Design for Embedded Image Processing on FPGAs: Bailey/Design for Embedded Image Processing on FPGAs , 2011 .

[7]  Stephen Dean Brown,et al.  Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs , 2013, TRETS.

[8]  Jason Helge Anderson,et al.  From software threads to parallel hardware in high-level synthesis for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[9]  Jason Helge Anderson,et al.  LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems , 2013, TECS.

[10]  Jürgen Teich,et al.  Automatic Optimization of Hardware Accelerators for Image Processing , 2015, ArXiv.

[11]  Anil K. Jain,et al.  Computer Vision Algorithms on Reconfigurable Logic Arrays , 1999, IEEE Trans. Parallel Distributed Syst..

[12]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[13]  Donald G. Bailey,et al.  Design for Embedded Image Processing on FPGAs , 2011 .

[14]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[15]  Sang-Yong Han,et al.  Exploiting Spatial and Temporal Parallelism in the Multithreaded Node Architecture Implemented on Superscalar RISC Processors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[16]  Jürgen Teich,et al.  Code generation from a domain-specific language for C-based HLS of hardware accelerators , 2014, 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[17]  Pat Hanrahan,et al.  Darkroom , 2014, ACM Trans. Graph..

[18]  Jürgen Teich,et al.  PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications , 2008, ARC.

[19]  Kazutoshi Wakabayashi,et al.  C-based SoC design flow and EDA tools: an ASIC and system vendorperspective , 2000, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[20]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[21]  Muhsen Owaida,et al.  Synthesis of Platform Architectures from OpenCL Programs , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.