Extending High-Level Synthesis for Task-Parallel Programs

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of result (QoR) and short development cycle compared with the traditional register-transfer level (RTL) design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools support task-parallel programs, the productivity is greatly limited in the code development, correctness verification, and QoR tuning cycles, due to the poor programmability, restricted software simulation, and slow code generation, respectively. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, universal software simulation, and fast code generation to overcome these limitations. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly accelerated by 3.2xand 6.8x, respectively.

[1]  Yu Wang,et al.  ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture , 2017, FPGA.

[2]  Jason Cong,et al.  Overcoming Data Transfer Bottlenecks in FPGA-based DNN Accelerators via Layer Conscious Memory Management , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[3]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[4]  Duncan H. Lawrie,et al.  Access and Alignment of Data in an Array Processor , 1975, IEEE Transactions on Computers.

[5]  Jason Cong,et al.  FLASH: Fast, Parallel, and Accurate Simulator for HLS , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Tanner Young-Schultz,et al.  Using OpenCL to Enable Software-like Development of an FPGA-Accelerated Biophotonic Cancer Treatment Simulator , 2020, FPGA.

[7]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Jason Cong,et al.  PolySA: Polyhedral-Based Systolic Array Auto-Compilation , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[9]  John Wawrzynek,et al.  AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning , 2020, MLSys.

[10]  Hyuk-Jae Lee,et al.  Generalized Cannon's algorithm for parallel matrix multiplication , 1997, ICS '97.

[11]  Norbert Wehn,et al.  When Massive GPU Parallelism Ain't Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network , 2020, FPGA.

[12]  Jason Cong,et al.  Rapid Cycle-Accurate Simulator for High-Level Synthesis , 2019, FPGA.

[13]  Jason Cong,et al.  SODA: Stencil with Optimized Dataflow Architecture , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[14]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[15]  Lise Getoor,et al.  Collective Classi!cation in Network Data , 2008 .

[16]  Onur Mutlu,et al.  Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL Applications on FPGAs , 2020, FPGA.

[17]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[18]  Jiuxi Meng,et al.  High-Performance FPGA Network Switch Architecture , 2020, FPGA.

[19]  Viktor K. Prasanna,et al.  HitGraph: High-throughput Graph Processing Framework on FPGA , 2019, IEEE Transactions on Parallel and Distributed Systems.

[20]  Peng Zhang Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[21]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[22]  Xuan Yang,et al.  Programming Heterogeneous Systems from an Image Processing DSL , 2016, ACM Trans. Archit. Code Optim..

[23]  Jason Cong,et al.  Exploiting Computation Reuse for Stencil Accelerators , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[24]  Jing Li,et al.  Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform , 2018, FPGA.

[25]  Steven J. E. Wilton,et al.  Fast Turnaround HLS Debugging Using Dependency Analysis and Debug Overlays , 2020, ACM Trans. Reconfigurable Technol. Syst..

[26]  James L. Peterson,et al.  Petri Nets , 1977, CSUR.

[27]  Jason Cong,et al.  HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration , 2020, FPGA.

[28]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[29]  Yu Wang,et al.  FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search , 2016, FPGA.

[30]  Yu Ting Chen,et al.  EASY: Efficient Arbiter SYnthesis from Multi-threaded Code , 2019, FPGA.

[31]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[32]  Zhiru Zhang,et al.  GraphZoom: A multi-level spectral approach for accurate and scalable graph embedding , 2020, ICLR.

[33]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[34]  Viktor Prasanna,et al.  GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms , 2019, FPGA.

[35]  Jure Leskovec,et al.  Learning to Discover Social Circles in Ego Networks , 2012, NIPS.

[36]  Roberto Ierusalimschy,et al.  Revisiting coroutines , 2009, TOPL.

[37]  Jason Cong,et al.  Latte: Locality Aware Transformation for High-Level Synthesis , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[38]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[39]  Paolo Ienne,et al.  Combining Dynamic & Static Scheduling in High-level Synthesis , 2020, FPGA.

[40]  Soojung Ryu,et al.  SimParallel: A high performance parallel SystemC simulator using hierarchical multi-threading , 2014, 2014 IEEE International Symposium on Circuits and Systems (ISCAS).

[41]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[42]  Torsten Hoefler,et al.  Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis , 2019, FPGA.

[43]  Jason Cong,et al.  An efficient and versatile scheduling algorithm based on SDC formulation , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[44]  Jason Cong,et al.  Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[45]  Tim Schmidt,et al.  Exploiting thread and data level parallelism for ultimate parallel SystemC simulation , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[46]  Pat Hanrahan,et al.  Fleet: A Framework for Massively Parallel Streaming on FPGAs , 2020, ASPLOS.

[47]  Michael Ferdman,et al.  FPGA-Accelerated Samplesort for Large Data Sets , 2020, FPGA.

[48]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[49]  Yu Wang,et al.  NXgraph: An efficient graph processing system on a single machine , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[50]  Jing Li,et al.  Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform , 2018, FPGA.

[51]  Jason Cong,et al.  End-to-End Optimization of Deep Learning Applications , 2020, FPGA.

[52]  Jin Hee Kim,et al.  High-Level Synthesis Techniques to Generate Deeply Pipelined Circuits for FPGAs with Registered Routing , 2019, 2019 International Conference on Field-Programmable Technology (ICFPT).

[53]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[54]  Melvin E. Conway,et al.  Design of a separable transition-diagram compiler , 1963, CACM.

[55]  Jason Cong,et al.  ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[56]  Magnus Jahre,et al.  DCMI , 2019, ACM Trans. Archit. Code Optim..

[57]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[58]  James C. Hoe,et al.  Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).