Exploring the Efficiency of OpenCL Pipe for Hiding Memory Latency on Cloud FPGAs

OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions when executing massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path that hinders the benefits of data-path customization.This paper explores the efficiency of “OpenCL Pipe” to hide memory access latency on cloud FPGAs by decoupling memory access from computation. The Pipe semantic is leveraged to split OpenCL kernels into “read”, “compute” and “write back” sub-kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high-performance applications from the Rodinia suite vs. 3.1. All our tests are conducted on the Xilinx VU9FP FPGA platform of Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS).11This work has been funded and supported by the Xilinx University Program (XUP)..

[1]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[2]  Pingfan Meng,et al.  Real-time 3D reconstruction for FPGAs: A case study for evaluating the performance, area, and programmability trade-offs of the Altera OpenCL SDK , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[3]  Bingsheng He,et al.  Performance Modeling and Directives Optimization for High-Level Synthesis on FPGA , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Jason Cong,et al.  Bandwidth optimization through on-chip memory restructuring for HLS , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Santosh G. Abraham,et al.  Effective stream-based and execution-based data prefetching , 2004, ICS '04.

[6]  Michael J. Flynn,et al.  Hardware and software cache prefetching techniques for MPEG benchmarks , 2000, IEEE Trans. Circuits Syst. Video Technol..

[7]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[8]  David R. Kaeli,et al.  Exploring the Efficiency of the OpenCL Pipe Semantic on an FPGA , 2016, SIGARCH Comput. Archit. News.

[9]  Hamed Tabkhi,et al.  Taxonomy of Spatial Parallelism on FPGAs for Massively Parallel Applications , 2018, 2018 31st IEEE International System-on-Chip Conference (SOCC).

[10]  Yun Liang,et al.  Lin-Analyzer: A high-level performance analysis tool for FPGA-based accelerators , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Doris Chen,et al.  Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[12]  Jason Cong,et al.  Understanding Performance Differences of FPGAs and GPUs: (Abtract Only) , 2018, FPGA.

[13]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[14]  Kenta Kasai,et al.  Flexible non-binary LDPC decoding on FPGAs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  John Wawrzynek,et al.  Architectural synthesis of computational pipelines with decoupled memory access , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[16]  Henry M. Levy,et al.  An Architecture for Software-Controlled Data Prefetching , 1991, ISCA.

[17]  Seth H. Pugsley,et al.  Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Wei Zhang,et al.  A performance analysis framework for optimizing OpenCL applications on FPGAs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Collin McCurdy,et al.  Diagnosis and optimization of application prefetching performance , 2013, ICS '13.

[21]  Hamed Tabkhi,et al.  Locality Aware Memory Assignment and Tiling , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[22]  Alexander V. Veidenbaum,et al.  Multiple stream tracker: a new hardware stride prefetcher , 2014, Conf. Computing Frontiers.

[23]  Wei Zhang,et al.  FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[24]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  David Bernstein,et al.  Compiler techniques for data prefetching on the PowerPC , 1995, PACT.

[26]  Shankar Balachandran,et al.  Hardware prefetchers for emerging parallel applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Martin Margala,et al.  High level programming of FPGAs for HPC and data centric applications , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[28]  Sean O. Settle High-performance Dynamic Programming on FPGAs with OpenCL , 2013 .

[29]  Peng Zhang Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[30]  Darren J. Kerbyson,et al.  Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[31]  Andrew C. Ling,et al.  An OpenCL(TM) Deep Learning Accelerator on Arria 10 , 2017 .