Loop Parallelization Techniques for FPGA Accelerator Synthesis

Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPU), all generated from exactly the same code base.

[1]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[2]  Jason Cong,et al.  AutoPilot: A Platform-Based ESL Synthesis System , 2008 .

[3]  Jürgen Teich,et al.  Loop coarsening in C-based High-Level Synthesis , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[4]  Fabrizio Ferrandi,et al.  Exploiting Outer Loops Vectorization in High Level Synthesis , 2015, ARCS.

[5]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[6]  Dejan Markovic,et al.  A Multi-Granularity FPGA With Hierarchical Interconnects for Efficient and Flexible Mobile Computing , 2015, IEEE Journal of Solid-State Circuits.

[7]  Martin Odersky,et al.  Making domain-specific hardware synthesis tools cost-efficient , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[8]  Jürgen Teich,et al.  Code generation from a domain-specific language for C-based HLS of hardware accelerators , 2014, 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[9]  Donald G. Bailey,et al.  Design for Embedded Image Processing on FPGAs , 2011 .

[10]  Pat Hanrahan,et al.  Darkroom , 2014, ACM Trans. Graph..

[11]  Anil K. Jain,et al.  Computer Vision Algorithms on Reconfigurable Logic Arrays , 1999, IEEE Trans. Parallel Distributed Syst..

[12]  Implementing FPGA Design with the OpenCL Standard , 2010 .

[13]  Jürgen Teich,et al.  FPGA-based accelerator design from a domain-specific language , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[14]  Alain Darte,et al.  Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA , 2012, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Muhsen Owaida,et al.  Synthesis of Platform Architectures from OpenCL Programs , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[16]  Vinod Kathail,et al.  An Integrated Framework for Application Engine Synthesis and Verification from High Level C Algorithms , 2008 .

[17]  Dejan Markovic,et al.  27.5 A multi-granularity FPGA with hierarchical interconnects for efficient and flexible mobile computing , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[18]  Stephen Dean Brown,et al.  Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs , 2013, TRETS.

[19]  Jason Helge Anderson,et al.  From software threads to parallel hardware in high-level synthesis for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[20]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[21]  Vinod Kathail,et al.  Algorithmic Synthesis Using PICO , 2008 .

[22]  Roberto Manduchi,et al.  Bilateral filtering for gray and color images , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[23]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[24]  Michael Meredith High-Level SystemC Synthesis with Forte's Cynthesizer , 2008 .

[25]  Donald G. Bailey,et al.  Design for Embedded Image Processing on FPGAs: Bailey/Design for Embedded Image Processing on FPGAs , 2011 .

[26]  Paul Feautrier,et al.  Polyhedron Model , 2011, Encyclopedia of Parallel Computing.

[27]  Jürgen Teich,et al.  PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications , 2008, ARC.

[28]  Jason Cong,et al.  Throughput Optimization for High-Level Synthesis Using Resource Constraints , 2014 .

[29]  Kazutoshi Wakabayashi,et al.  C-based SoC design flow and EDA tools: an ASIC and system vendorperspective , 2000, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[30]  Marc Reichenbach,et al.  A Generic VHDL Template for 2D Stencil Code Applications on FPGAs , 2012, 2012 IEEE 15th International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops.

[31]  Geppino Pucci,et al.  Universality in VLSI Computation , 2011, ParCo 2011.

[32]  Jürgen Teich,et al.  HIPAcc: A Domain-Specific Language and Compiler for Image Processing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[33]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[34]  Jason Cong,et al.  FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[35]  Sang-Yong Han,et al.  Exploiting Spatial and Temporal Parallelism in the Multithreaded Node Architecture Implemented on Superscalar RISC Processors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[36]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[37]  Jürgen Teich,et al.  Code generation for embedded heterogeneous architectures on android , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[38]  Albert Cohen,et al.  Polyhedral-Model Guided Loop-Nest Auto-Vectorization , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[39]  Adrian Park,et al.  Designing Modular Hardware Accelerators in C with ROCCC 2.0 , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[40]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.