AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still must manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. We address this problem by incorporating an automated DSE framework - AutoDSE - that leverages bottleneck-guided gradient optimizer to systematically find a better design point. AutoDSE finds the bottleneck of the design in each step and focuses on high-impact parameters to overcome that, which is like the approach an expert would take. The experimental results show that AutoDSE is able to find the design point that achieves, on the geometric mean, 19.9x speedup over one CPU core for Machsuite and Rodinia benchmarks and 1.04x over the manually designed HLS accelerated vision kernels in Xilinx Vitis libraries yet with 26x reduction of their optimization pragmas.

[1]  Jason Cong,et al.  AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA , 2021, FPGA.

[2]  Francis C. M. Lau,et al.  Accelerating FPGA Prototyping through Predictive Model-Based HLS Design Space Exploration , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[3]  Benjamin Carrion Schafer,et al.  Machine-learning based simulated annealer method for high level synthesis design space exploration , 2014, Proceedings of the 2014 Electronic System Level Synthesis Conference (ESLsyn).

[4]  Jason Cong,et al.  Software Infrastructure for Enabling FPGA-Based Accelerations in Data Centers: Invited Paper , 2016, ISLPED.

[5]  Eriko Nurvitadhi,et al.  Artisan: a Meta-Programming Approach For Codifying Optimisation Strategies , 2020, 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[6]  Kazutoshi Wakabayashi,et al.  Machine learning predictive modelling high-level synthesis design space exploration , 2012, IET Comput. Digit. Tech..

[7]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Giovanni Ansaloni,et al.  Compiler-Assisted Selection of Hardware Acceleration Candidates from Application Source Code , 2019, 2019 IEEE 37th International Conference on Computer Design (ICCD).

[9]  Vittorio Zaccaria,et al.  SPIRIT: Spectral-Aware Pareto Iterative Refinement Optimization for Supervised High-Level Synthesis , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[10]  Pengfei Xu,et al.  AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs , 2020, FPGA.

[11]  Jason Cong,et al.  AutoPilot: A Platform-Based ESL Synthesis System , 2008 .

[12]  Benjamin Carrión Schäfer Parallel High-Level Synthesis Design Space Exploration for Behavioral IPs of Exact Latencies , 2017, ACM Trans. Design Autom. Electr. Syst..

[13]  Laura Pozzi,et al.  Cluster-Based Heuristic for High Level Synthesis Design Space Exploration , 2018, IEEE Transactions on Emerging Topics in Computing.

[14]  Kunle Olukotun,et al.  Generating Configurable Hardware from Parallel Patterns , 2015, ASPLOS.

[15]  Jason Cong,et al.  HLS-Based Optimization and Design Space Exploration for Applications with Variable Loop Bounds , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[16]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[17]  Zhiru Zhang,et al.  A Parallel Bandit-Based Approach for Autotuning FPGA Compilation , 2017, FPGA.

[18]  Laura Pozzi,et al.  Lattice-Traversing Design Space Exploration for High Level Synthesis , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[19]  Jason Cong,et al.  PolySA: Polyhedral-Based Systolic Array Auto-Compilation , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[20]  David Novo,et al.  Design Space Exploration of LDPC Decoders Using High-Level Synthesis , 2017, IEEE Access.

[21]  Yun Liang,et al.  Design space exploration of multiple loops on FPGAs using high level synthesis , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[22]  Michèle Sebag,et al.  Analyzing bandit-based adaptive operator selection mechanisms , 2010, Annals of Mathematics and Artificial Intelligence.

[23]  Jason Cong,et al.  Source-to-Source Optimization for HLS , 2016, FPGAs for Software Programmers.

[24]  Jason Cong,et al.  Improving polyhedral code generation for high-level synthesis , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[25]  Jason Cong,et al.  Multilevel generalized force-directed method for circuit placement , 2005, ISPD '05.

[26]  Jason Cong,et al.  HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing , 2019, FPGA.

[27]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[28]  Jianli Chen,et al.  Correlated Multi-objective Multi-fidelity Optimization for HLS Directives Design , 2021, 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[29]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[30]  Jason Cong,et al.  SODA: Stencil with Optimized Dataflow Architecture , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Eric S. Chung,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[33]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[34]  Kunle Olukotun,et al.  Automatic Generation of Efficient Accelerators for Reconfigurable Hardware , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[35]  Luca P. Carloni,et al.  On learning-based methods for design-space exploration with High-Level Synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[36]  Wei Zhang,et al.  FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[37]  Kazutoshi Wakabayashi,et al.  Divide and conquer high-level synthesis design space exploration , 2012, TODE.

[38]  Alberto Scolari,et al.  Pareto Optimal Design Space Exploration for Accelerated CNN on FPGA , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[39]  Evangeline F. Y. Young,et al.  Fast and Accurate Estimation of Quality of Results in High-Level Synthesis with Machine Learning , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[40]  Jason Cong,et al.  S2FA: An Accelerator Automation Framework for Heterogeneous Computing in Datacenters , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[41]  Peng Zhang Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[42]  Yun Liang,et al.  COMBA: A comprehensive model-based analysis framework for high level synthesis of real applications , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[43]  Zhiru Zhang,et al.  Replication Package for Article: Predictable Accelerator Design with Time-Sensitive Affine types , 2020, Artifact Digital Object Group.

[44]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[45]  Yun Liang,et al.  Lin-Analyzer: A high-level performance analysis tool for FPGA-based accelerators , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[46]  Yun Liang,et al.  Design Space exploration of FPGA-based accelerators with multi-level parallelism , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[47]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[48]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[49]  Song Han,et al.  Fast inference of deep neural networks in FPGAs for particle physics , 2018, Journal of Instrumentation.

[50]  Gu-Yeon Wei,et al.  MachSuite: Benchmarks for accelerator design and customized architectures , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[51]  Jason Cong,et al.  End-to-End Optimization of Deep Learning Applications , 2020, FPGA.

[52]  Yun Liang,et al.  FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System , 2020, ASPLOS.