Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future Prospects

FPGA-based accelerators are increasingly popular across a broad range of applications, because they offer massive parallelism, high energy efficiency, and great flexibility for customizations. However, difficulties in programming and integrating FPGAs have hindered their widespread adoption. Since the mid 2000s, there has been extensive research and development toward making FPGAs accessible to software-inclined developers, besides hardware specialists. Many programming models and automated synthesis tools, such as high-level synthesis, have been proposed to tackle this grand challenge. In this survey, we describe the progression and future prospects of the ongoing journey in significantly improving the software programmability of FPGAs. We first provide a taxonomy of the essential techniques for building a high-performance FPGA accelerator, which requires customizations of the compute engines, memory hierarchy, and data representations. We then summarize a rich spectrum of work on programming abstractions and optimizing compilers that provide different trade-offs between performance and productivity. Finally, we highlight several additional challenges and opportunities that deserve extra attention by the community to bring FPGA-based computing to the masses.

[1]  Fabrizio Ferrandi,et al.  Using Efficient Path Profiling to Optimize Memory Consumption of On-Chip Debugging for High-Level Synthesis , 2017, ACM Trans. Embed. Comput. Syst..

[2]  Florent de Dinechin,et al.  Designing Custom Arithmetic Data Paths with FloPoCo , 2011, IEEE Design & Test of Computers.

[3]  Tomofumi Yuki,et al.  Toward Speculative Loop Pipelining for High-Level Synthesis , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Viktor K. Prasanna,et al.  HitGraph: High-throughput Graph Processing Framework on FPGA , 2019, IEEE Transactions on Parallel and Distributed Systems.

[5]  Franz Franchetti,et al.  Computer Generation of Hardware for Linear Digital Signal Processing Transforms , 2012, TODE.

[6]  Alessandro Cilardo,et al.  Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning , 2015, ACM Trans. Archit. Code Optim..

[7]  Jason Cong,et al.  Customizable Computing—From Single Chip to Datacenters , 2019, Proceedings of the IEEE.

[8]  Christian Lengauer,et al.  Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..

[9]  Xuan Yang,et al.  Programming Heterogeneous Systems from an Image Processing DSL , 2016, ACM Trans. Archit. Code Optim..

[10]  Amit K. Roy-Chowdhury,et al.  Evaluation and Acceleration of High-Throughput Fixed-Point Object Detection on FPGAs , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Nong Xiao,et al.  Coarse-Grained Parallel Routing With Recursive Partitioning for FPGAs , 2021, IEEE Transactions on Parallel and Distributed Systems.

[12]  Ray C. C. Cheung,et al.  Area-efficient architectures for double precision multiplier on FPGA, with run-time-reconfigurable dual single precision support , 2013, Microelectron. J..

[13]  Yu Ting Chen,et al.  A Survey and Evaluation of FPGA High-Level Synthesis Tools , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[14]  Luka Daoud,et al.  A Survey of High Level Synthesis Languages, Tools, and Compilers for Reconfigurable High Performance Computing , 2013, ICSS.

[15]  Jürgen Teich,et al.  HIPAcc: A Domain-Specific Language and Compiler for Image Processing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[16]  Jason Cong,et al.  Source-to-Source Optimization for HLS , 2016, FPGAs for Software Programmers.

[17]  George A. Constantinides,et al.  Polyhedral-Based Dynamic Loop Pipelining for High-Level Synthesis , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Patrice Quinton,et al.  Polyhedral Bubble Insertion: A Method to Improve Nested Loop Pipelining for High-Level Synthesis , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[19]  Brad L. Hutchings,et al.  Enhancing debug observability for HLS-based FPGA circuits through source-to-source compilation , 2018, J. Parallel Distributed Comput..

[20]  Vaughn Betz,et al.  Networks-on-Chip for FPGAs: Hard, Soft or Mixed? , 2014, TRETS.

[21]  J. M. Pierre Langlois,et al.  Enhanced Precision Analysis for Accuracy-Aware Bit-Width Optimization Using Affine Arithmetic , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[22]  Alan D. George,et al.  ACM Transactions on Reconfigurable Technology and Systems Performance Analysis Framework for High-Level Language Applications in Reconfigurable Computing , 2009 .

[23]  Steven J. E. Wilton,et al.  Rapid Triggering Capability Using an Adaptive Overlay during FPGA Debug , 2018, ACM Trans. Design Autom. Electr. Syst..

[24]  M. H. van Emden,et al.  Interval arithmetic: From principles to implementation , 2001, JACM.

[25]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[26]  Zhenhua Duan,et al.  ParRA: A Shared Memory Parallel FPGA Router Using Hybrid Partitioning Approach , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[27]  Jonathan Rose,et al.  Exploration and Customization of FPGA-Based Soft Processors , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[28]  Mohsen Imani,et al.  QuantHD: A Quantization Framework for Hyperdimensional Computing , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[30]  J. Gregory Steffan,et al.  The Potential for a GPU-Like Overlay Architecture for FPGAs , 2011, Int. J. Reconfigurable Comput..

[31]  Jason Cong,et al.  An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[32]  Stephen A. Edwards,et al.  Compositional Dataflow Circuits , 2019, ACM Trans. Embed. Comput. Syst..

[33]  Kevin E. Murray,et al.  VTR 8: High Performance CAD and Customizable FPGA Architecture Modelling , 2020 .

[34]  Katrina Falkner,et al.  Towards Automatic High-Level Code Deployment on Reconfigurable Platforms: A Survey of High-Level Synthesis Tools and Toolchains , 2020, IEEE Access.

[35]  Steven Trimberger,et al.  Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology , 2015, Proceedings of the IEEE.

[36]  Yun Liang,et al.  FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[37]  Mickaël Raulet,et al.  OpenDF: a dataflow toolset for reconfigurable hardware and multicore systems , 2008, CARN.

[38]  Jiaqi Gu,et al.  DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[39]  Margaret Martonosi,et al.  Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous Architectures , 2017, ACM Trans. Archit. Code Optim..

[40]  Scott A. Mahlke,et al.  PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators , 2002, J. VLSI Signal Process..

[41]  Jason Cong,et al.  CPU-FPGA Coscheduling for Big Data Applications , 2018, IEEE Design & Test.

[42]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[43]  Paolo Ienne,et al.  An Out-of-Order Load-Store Queue for Spatial Computing , 2017, ACM Trans. Embed. Comput. Syst..

[44]  Steven J. E. Wilton,et al.  Signal-Tracing Techniques for In-System FPGA Debugging of High-Level Synthesis Circuits , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[45]  Jason Cong,et al.  Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[46]  Yu Wang,et al.  DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-Based CNN Accelerators , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[47]  Pat Hanrahan,et al.  Darkroom , 2014, ACM Trans. Graph..

[48]  Giovanni Ansaloni,et al.  Leveraging Prior Knowledge for Effective Design-Space Exploration in High-Level Synthesis , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[49]  Olivier Sentieys,et al.  A frame-based domain-specific language for rapid prototyping of FPGA-based software-defined radios , 2014, EURASIP J. Adv. Signal Process..

[50]  Vaughn Betz,et al.  Efficient and Deterministic Parallel Placement for FPGAs , 2011, TODE.

[51]  Wayne Luk,et al.  Ieee Transactions on Computer-aided Design of Integrated Circuits and Systems Accuracy Guaranteed Bit-width Optimization Abstract— We Present Minibit, an Automated Static Approach for Optimizing Bit-widths of Fixed-point Feedforward Designs with Guaranteed Accuracy. Methods to Minimize Both the In- , 2022 .

[52]  Miriam Leeser,et al.  VFloat: A Variable Precision Fixed- and Floating-Point Library for Reconfigurable Hardware , 2010, TRETS.

[53]  Muhammad Faisal Siddiqui,et al.  FPGA Based Real-Time Implementation of Online EMD With Fixed Point Architecture , 2019, IEEE Access.