Evaluating Radial Basis Function Kernel on OpenCL FPGA Platform

Field-programmable gate arrays (FPGAs) are becoming a promising heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The emerging high-level synthesis (HLS) tools provide a streamlined design flow to facilitate the use of FPGAs for researchers who have little FPGA development experience. In this paper, we choose the kernel, Radial Basis Function, in a support vector machine as a case study to evaluate the potential of implementing machine learning kernels on FPGAs, and the capabilities of an HLS tool to convert a kernel written in high-level language to an FPGA implementation. We explain the HLS flow and the RBF kernel. We evaluate the kernel in an OpenCL-to-FPGA HLS flow, and describe the optimizations of the kernel. Our optimizations using kernel vectorization and loop unrolling improve the kernel performance by a factor of 15.8 compared to a baseline kernel on the Nallatech 385A FPGA card that features an Intel Arria 10 GX 1150 FPGA. In terms of energy efficiency, the performance per watt on the FPGA platform is 2.8X higher than that on an Intel Xeon 16-core CPU, and 1.7X higher than that on an Nvidia Tesla K80 GPU. On the other hand, the performance per watt on an Intel Xeon Phi Knights Landing CPU and an Nvidia Tesla P100 GPU are 5.3X and 1.7X higher than that on the FPGA, respectively.

[1]  Franck Cappello,et al.  Evaluation of a Floating-Point Intensive Kernel on FPGA - A Case Study of Geodesic Distance Kernel , 2017, Euro-Par Workshops.

[2]  Robert G. Dimond,et al.  Accelerating Large-Scale HPC Applications Using FPGAs , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[3]  Jeremy Chritz,et al.  Characterization of OpenCL on a scalable FPGA architecture , 2014, 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14).

[4]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[5]  Wim Vanderbauwhede,et al.  High-Performance Computing Using FPGAs , 2013 .

[6]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[8]  Martin C. Herbordt,et al.  Leving high performance FPGA-based computing , 2007 .

[9]  Ioannis Kompatsiaris,et al.  GPU acceleration for support vector machines , 2011, WIAMIS 2011.

[10]  Sean Rul,et al.  An experimental study on performance portability of OpenCL kernels , 2010, HiPC 2010.

[11]  Huiyang Zhou,et al.  Tuning Stencil codes in OpenCL for FPGAs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[12]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[13]  Martin Langhammer,et al.  Arria™ 10 device architecture , 2015, 2015 IEEE Custom Integrated Circuits Conference (CICC).

[14]  Wu-chun Feng,et al.  Accelerating Workloads on FPGAs via OpenCL: A Case Study with OpenDwarfs , 2016 .

[15]  Krzysztof Sopyla,et al.  SVM with CUDA Accelerated Kernels for Big Sparse Problems , 2012, ICAISC.

[16]  Mark J. F. Gales,et al.  Speech Recognition using SVMs , 2001, NIPS.

[17]  Yu Cao,et al.  HeteroSpark: A heterogeneous CPU/GPU Spark platform for machine learning algorithms , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[18]  Jungwon Kim,et al.  OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[19]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[20]  Vaughn Betz,et al.  VPR: A new packing, placement and routing tool for FPGA research , 1997, FPL.

[21]  Shao-Yi Chien,et al.  Support Vector Machines on GPU with Sparse Matrix Format , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[22]  Hemalatha,et al.  High-Performance Computing using GPUs , 2013 .

[23]  Ji-Bo Wang,et al.  GPU Accelerated Support Vector Machines for Mining High-Throughput Screening Data , 2009, J. Chem. Inf. Model..

[24]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[25]  Pierre-Henri Horrein,et al.  Energy-efficient FPGA implementation for binomial option pricing using OpenCL , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[27]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Yu Ting Chen,et al.  A Survey and Evaluation of FPGA High-Level Synthesis Tools , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.