Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. In this work, we study the timing issues in a diverse set of realistic and complex FPGA HLS designs. (1) We observe that in almost all cases the frequency degradation is caused by the broadcast structures generated by the HLS compiler. (2) We classify three major types of broadcasts in HLS-generated designs, including high-fanout data signals, pipeline flow control signals and synchronization signals for concurrent modules. (3) We reveal a number of limitations of the current HLS tools that result in those broadcast-related timing issues. (4) We propose a set of effective yet easy-to-implement approaches, including broadcast-aware scheduling, synchronization pruning, and skid-buffer-based flow control. Our experimental results show that our methods can improve the maximum frequency of a set of nine representative HLS benchmarks by 53% on average. In some cases, the frequency gain is more than 100 MHz.

[1]  Jason Cong,et al.  Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[2]  Jason Cong,et al.  Buffered Steiner tree construction with wire sizing for interconnect layout optimization , 1996, Proceedings of International Conference on Computer Aided Design.

[3]  H. James Hoover,et al.  Bounding Fan-out in Logical Networks , 1984, JACM.

[4]  Jason Cong,et al.  SODA: Stencil with Optimized Dataflow Architecture , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[5]  Zhiru Zhang,et al.  Area-efficient pipelining for FPGA-targeted high-level synthesis , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  Jason Cong,et al.  Towards layout-friendly high-level synthesis , 2012, ISPD '12.

[7]  Alberto L. Sangiovanni-Vincentelli,et al.  A heuristic algorithm for the fanout problem , 1991, DAC '90.

[8]  Nozomu Togawa,et al.  Clock skew estimate modeling for FPGA high-level synthesis and its application , 2015, 2015 IEEE 11th International Conference on ASIC (ASICON).

[9]  Zhiru Zhang,et al.  Mapping-Aware Constrained Scheduling for LUT-Based FPGAs , 2015, FPGA.

[10]  John Lillis,et al.  Timing optimization of FPGA placements by logic replication , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[11]  Deming Chen,et al.  Fast and effective placement and routing directed high-level synthesis for FPGAs , 2014, FPGA.

[12]  Jason Cong,et al.  Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[13]  Luca P. Carloni The Role of Back-Pressure in Implementing Latency-Insensitive Systems , 2006, Electron. Notes Theor. Comput. Sci..

[14]  Jason Cong,et al.  Latte: Locality Aware Transformation for High-Level Synthesis , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[15]  Nozomu Togawa,et al.  A high-level synthesis algorithm for FPGA designs optimizing critical path with interconnection-delay and clock-skew consideration , 2016, 2016 International Symposium on VLSI Design, Automation and Test (VLSI-DAT).

[16]  Jason Cong,et al.  Exploiting Computation Reuse for Stencil Accelerators , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[17]  Masato Tatsuoka,et al.  Wire congestion aware high level synthesis flow with source code compiler , 2018, 2018 International Conference on IC Design & Technology (ICICDT).

[18]  Nicholas Weaver,et al.  Retiming, Repipelining and C-Slow Retiming , 2008 .

[19]  Zhiru Zhang,et al.  Accelerating Face Detection on Programmable SoC Using C-Based Synthesis , 2017, FPGA.

[20]  Peng Zhang Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[21]  Jason Cong,et al.  SMEM++: A Pipelined and Time-Multiplexed SMEM Seeding Accelerator for Genome Sequencing , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).