ScaleHLS: Scalable High-Level Synthesis through MLIR

High-level Synthesis (HLS) has been widely adopted as it significantly improves the hardware design productivity and enables efficient design space exploration (DSE). HLS tools can be used to deliver solutions for many different kinds of design problems, which are often better solved with different levels of abstraction. While existing HLS tools are built using compiler infrastructures largely based on a single-level abstraction (e.g., LLVM), we propose ScaleHLS1, a next-generation HLS compilation flow, on top of a multi-level compiler infrastructure called MLIR, for the first time. By using an intermediate representation (IR) that can be better tuned to particular algorithms at different representation levels, we are able to build this new HLS tool that is more scalable and customizable towards various applications coming with intrinsic structural or functional hierarchies. ScaleHLS is able to represent and optimize HLS designs at multiple levels of abstraction and provides an HLS-dedicated transform and analysis library to solve the optimization problems at the suitable representation levels. On top of the library, we also build an automated DSE engine to explore the multi-dimensional design space efficiently. In addition, we develop an HLS C front-end and a C/C++ emission back-end to translate HLS designs into/from MLIR for enabling the end-to-end ScaleHLS flow. Experimental results show that, comparing to the baseline designs only optimized by Xilinx Vivado HLS, ScaleHLS improves the performances with amazing quality-of-results – up to 768.1× better on computation kernel level programs and up to 3825.0× better on neural network models.

[1]  Evangeline F. Y. Young,et al.  Fast and Accurate Estimation of Quality of Results in High-Level Synthesis with Machine Learning , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[2]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[3]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[4]  Yuan Meng,et al.  DYNAMAP: Dynamic Algorithm Mapping Framework for Low Latency CNN Inference , 2020, ArXiv.

[5]  Yun Liang,et al.  A study of high-level synthesis: Promises and challenges , 2011, 2011 9th IEEE International Conference on ASIC.

[6]  Jason Cong,et al.  Platform choices and design demands for IoT platforms: cost, power, and performance tradeoffs , 2016, IET Cyper-Phys. Syst.: Theory & Appl..

[7]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Tung D. Le,et al.  Compiling ONNX Neural Network Models Using MLIR , 2020, ArXiv.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  J. Ramanujam,et al.  Optimistic Delinearization of Parametrically Sized Arrays , 2015, ICS.

[11]  Zi Wang,et al.  High-Level Synthesis Design Space Exploration: Past, Present, and Future , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[12]  Peter M. Athanas,et al.  Enabling development of OpenCL applications on FPGA platforms , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[13]  Jason Cong,et al.  Multilevel Granularity Parallelism Synthesis on FPGAs , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[14]  Laura Pozzi,et al.  Lattice-Traversing Design Space Exploration for High Level Synthesis , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[15]  Seyong Lee,et al.  OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing , 2014, HPDC '14.

[16]  William S. Moses Polygeist: Affine C in MLIR , 2021 .

[17]  Deming Chen,et al.  A polyhedral-based SystemC modeling and generation framework for effective low-power design space exploration , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[18]  Jinjun Xiong,et al.  SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems , 2020, MLSys.

[19]  Yun Liang,et al.  COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[20]  Yao Chen,et al.  High Level Synthesis of Complex Applications: An H.264 Video Decoder , 2016, FPGA.

[21]  Yao Chen,et al.  Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs , 2019, FPGA.

[22]  Deming Chen,et al.  HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[23]  Di He,et al.  Machine learning on FPGAs to face the IoT revolution , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[24]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[25]  Yuan Xie,et al.  IRONMAN: GNN-assisted Design Space Exploration in High-Level Synthesis via Reinforcement Learning , 2021, ACM Great Lakes Symposium on VLSI.

[26]  Philip Brisk,et al.  HLSPredict: Cross Platform Performance Prediction for FPGA High-Level Synthesis , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  Wei Zhang,et al.  FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[29]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Jason Cong,et al.  FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[32]  Deming Chen,et al.  Accurate high-level modeling and automated hardware/software co-design for effective SoC design space exploration , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[33]  Thierry Moreau,et al.  Graph Optimizer Tensor Optimizer VTA JIT Runtime VTA ISA VTA MicroArchitecture , 2018 .

[34]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[35]  Alessandro Cilardo,et al.  Interplay of loop unrolling and multidimensional memory partitioning in HLS , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36]  Liang Zhao,et al.  Pyramid: Machine Learning Framework to Estimate the Optimal Timing and Resource Usage of a High-Level Synthesis Design , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[37]  Benjamin Carrion Schafer,et al.  Adaptive Simulated Annealer for high level synthesis design space exploration , 2009, 2009 International Symposium on VLSI Design, Automation and Test.

[38]  Jason Cong,et al.  AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators , 2020, ACM Trans. Design Autom. Electr. Syst..

[39]  Yun Liang,et al.  Lin-Analyzer: A high-level performance analysis tool for FPGA-based accelerators , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[40]  Jungwon Kim,et al.  OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[41]  Rudolf Eigenmann,et al.  Cetus: A Source-to-Source Compiler Infrastructure for Multicores , 2009, Computer.

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[44]  Jinjun Xiong,et al.  DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).