Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels

The use of FPGAs in heterogeneous systems are valuable because they can be used to architect custom hardware to accelerate a particular application or domain. However, they are notoriously difficult to program. The development of high level synthesis tools like OpenCL make FPGA development more accessible, but not without its own challenges. The synthesized hardware comes from a description that is semantically closer to the application, which leaves the underlying hardware implementation unclear. Moreover, the interaction of the hardware tuning knobs exposed using a higher level specification increases the challenge of finding the most performant hardware configuration. In this work, we address these aforementioned challenges by describing how to approach the design space, using both information from the literature as well as by describing a methodology to better visualize the resulting hardware from the high level specification. Finally, we present an empirical evaluation of the impact of vectorizing data types as a tunable knob and its interaction among other coarse-grained hardware knobs.

[1]  Onur Mutlu,et al.  Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL Applications on FPGAs , 2020, FPGA.

[2]  Alice C. Parker,et al.  The high-level synthesis of digital systems , 1990, Proc. IEEE.

[3]  Jing Li,et al.  Efficient Large-Scale Approximate Nearest Neighbor Search on OpenCL FPGA , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Roger D. Chamberlain,et al.  Data Integration Tasks on Heterogeneous Systems Using OpenCL , 2019, IWOCL.

[5]  Hal Finkel,et al.  Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGA , 2018, IWOCL.

[6]  Dirk Stroobandt,et al.  Exploring Opencl on a CPU-FPGA heterogeneous architecture research platform (HARP) , 2019 .

[7]  Roger D. Chamberlain,et al.  Exploring Portability and Performance of OpenCL FPGA Kernels on Intel HARPv2 , 2019, IWOCL.

[8]  Roger D. Chamberlain,et al.  DIBS: A Data Integration Benchmark Suite , 2018, ICPE Companion.

[9]  Tilman Wolf,et al.  CommBench-a telecommunications benchmark for network processors , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[10]  Jack J. Dongarra,et al.  Massively Parallel Automated Software Tuning , 2019, ICPP.

[11]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[13]  Michael Anderson,et al.  Compressed Sensing MRI Reconstruction on Intel HARPv2 , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[14]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[15]  Martin C. Herbordt,et al.  An Empirically Guided Optimization Framework for FPGA OpenCL , 2018, 2018 International Conference on Field-Programmable Technology (FPT).