MKPipe: a compiler framework for optimizing multi-kernel workloads in OpenCL for FPGA

OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, we propose a source-to-source compiler framework, MKPipe, for optimizing multi-kernel workloads in OpenCL for FPGA. Besides channels, we propose new schemes to enable multi-kernel pipelines. Our optimizing compiler employs a systematic approach to explore the tradeoffs of these optimizations methods. To enable more efficient overlapping between kernel execution, we also propose a novel workitem/workgroup-id remapping technique. Furthermore, we propose new algorithms for throughput balancing and resource balancing to tune the optimizations upon individual kernels in the multi-kernel workloads. Our results show that our compiler-optimized multi-kernels achieve up to 3.6x (1.4x on average) speedup over the baseline, in which the kernels have already been optimized individually.

[1]  Mike Hutton Stratix® 10: 14nm FPGA delivering 1GHz , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[2]  Martin C. Herbordt,et al.  An Empirically Guided Optimization Framework for FPGA OpenCL , 2018, 2018 International Conference on Field-Programmable Technology (FPT).

[3]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[4]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Wei Zheng,et al.  Design of FPGA based high-speed data acquisition and real-time data processing system on J-TEXT tokamak , 2014 .

[6]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[7]  Dieter Schmalstieg,et al.  Whippletree , 2014, ACM Trans. Graph..

[8]  Wu-chun Feng,et al.  Accelerating Workloads on FPGAs via OpenCL: A Case Study with OpenDwarfs , 2016 .

[9]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[10]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[11]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[12]  Dong Wang,et al.  PipeCNN: An OpenCL-based open-source FPGA accelerator for convolution neural networks , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[13]  Pingfan Meng,et al.  Spector: An OpenCL FPGA benchmark suite , 2016, 2016 International Conference on Field-Programmable Technology (FPT).

[14]  Alan D. George,et al.  Comparative analysis of OpenCL vs. HDL with image-processing kernels on Stratix-V FPGA , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[15]  Adel A. El-Zoghabi,et al.  Optimized implementation of OpenCL kernels on FPGAs , 2019, J. Syst. Archit..

[16]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[17]  Wei Zhang,et al.  A performance analysis framework for optimizing OpenCL applications on FPGAs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[19]  Wei Zhang,et al.  A study of data partitioning on OpenCL-based FPGAs , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[20]  Chen Yang,et al.  OpenCL for HPC with FPGAs: Case study in molecular electrostatics , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[21]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[23]  Chris Lattner,et al.  LLVM: AN INFRASTRUCTURE FOR MULTI-STAGE OPTIMIZATION , 2000 .

[24]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[25]  Huiyang Zhou,et al.  Tuning Stencil codes in OpenCL for FPGAs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[26]  Wenguang Chen,et al.  VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).