Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGA

When Field-programmable gate arrays (FPGAs) can implement streaming applications efficiently and high-level synthesis (HLS) tools allow people, who have little hardware design knowledge, to evaluate an application on FPGAs, there is a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we explore the implementation space and discuss the techniques of optimizing the performance of the streaming kernels using the Intel OpenCL SDK for FPGA. On the Nallatech 385A FPGA platform that features an Arria 10 GX1150 FPGA, the experimental results show that FPGA resources, such as block RAMs and DSPs, can limit the performance of a kernel before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2.8 to 10. The combination of the two techniques can improve the performance by a factor of 3.3 to 16, achieving the highest performance. To improve the performance of streaming kernels with compute unit duplication, the local work size needs to be tuned. The optimal value can increase the performance of a duplicated kernel without tuning by a factor of 3 to 70.

[1]  Jason Cong,et al.  Combining computation and communication optimizations in system synthesis for streaming applications , 2014, FPGA.

[2]  Weng-Fai Wong,et al.  A computing origami: Folding streams in FPGAs , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[3]  T. Vincenty DIRECT AND INVERSE SOLUTIONS OF GEODESICS ON THE ELLIPSOID WITH APPLICATION OF NESTED EQUATIONS , 1975 .

[4]  Russell Tessier,et al.  FPGA Architecture: Survey and Challenges , 2008, Found. Trends Electron. Des. Autom..

[5]  Jeremy Chritz,et al.  Characterization of OpenCL on a scalable FPGA architecture , 2014, 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14).

[6]  Dirk Koch,et al.  FPGAs for Software Programmers , 2016 .

[7]  Jason Cong,et al.  Combining module selection and replication for throughput-driven streaming programs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[9]  Nathaniel S. Borenstein,et al.  Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types , 1996, RFC.

[10]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[11]  Huiyang Zhou,et al.  Tuning Stencil codes in OpenCL for FPGAs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[12]  Dong Nguyen,et al.  Communication-aware mapping of stream graphs for multi-GPU platforms , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Weng-Fai Wong,et al.  Scalable framework for mapping streaming applications onto multi-GPU systems , 2012, PPoPP '12.

[14]  Pierre-Henri Horrein,et al.  Energy-efficient FPGA implementation for binomial option pricing using OpenCL , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Sean Rul,et al.  An experimental study on performance portability of OpenCL kernels , 2010, HiPC 2010.

[16]  Wu-chun Feng,et al.  Accelerating Workloads on FPGAs via OpenCL: A Case Study with OpenDwarfs , 2016 .