FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs

The recent adoption of OpenCL programming model by FPGA vendors has realized the function portability of OpenCL workloads on FPGA. However, the poor performance portability prevents its wide adoption. To harness the power of FPGAs using OpenCL programming model, it is advantageous to design an analytical performance model to estimate the performance of OpenCL workloads on FPGAs and provide insights into the performance bottlenecks of OpenCL model on FPGA architecture. To this end, this paper presents FlexCL, an analytical performance model for OpenCL workloads on flexible FPGAs. FlexCL estimates the overall performance by tightly coupling the off-chip global memory and on-chip computation models based on the communication mode. Experiments demonstrate that with respect to RTL-based implementation, the average of absolute error of FlexCL is 9.5% and 8.7% for the Rodinia and PolyBench suite, respectively. Moreover, FlexCL enables rapid exploration of the design space within seconds instead of hours or days.

[1]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Yun Liang,et al.  High-Level Synthesis: Productivity, Performance, and Software Constraints , 2012, J. Electr. Comput. Eng..

[3]  Yun Liang,et al.  Lin-Analyzer: A high-level performance analysis tool for FPGA-based accelerators , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[5]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[6]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[7]  Kazutoshi Wakabayashi,et al.  Divide and conquer high-level synthesis design space exploration , 2012, TODE.

[8]  Wei Zhang,et al.  A performance analysis framework for optimizing OpenCL applications on FPGAs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[9]  Yun Liang,et al.  A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[10]  Jason Cong,et al.  An efficient and versatile scheduling algorithm based on SDC formulation , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[11]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[12]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[13]  Frank Vahid,et al.  Warp Processing: Dynamic Translation of Binaries to FPGA Circuits , 2008, Computer.

[14]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[15]  Hyojin Choi,et al.  Memory access pattern-aware DRAM performance model for multi-core systems , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[16]  Luca P. Carloni,et al.  On learning-based methods for design-space exploration with High-Level Synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Josep Llosa,et al.  Swing module scheduling: a lifetime-sensitive approach , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[18]  Zhiru Zhang,et al.  SDC-based modulo scheduling for pipeline synthesis , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[19]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[20]  Adrian Park,et al.  Designing Modular Hardware Accelerators in C with ROCCC 2.0 , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[21]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[22]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[23]  Amit Kumar Singh,et al.  Exploiting loop-array dependencies to accelerate the design space exploration with high level synthesis , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Yong Wang,et al.  SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[25]  Jean Luc Philippe,et al.  Design Space Pruning Through Early Estimations of Area/Delay Tradeoffs for FPGA Implementations , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.