Optimizing Parallel Reduction on OpenCL FPGA Platform – A Case Study of Frequent Pattern Compression

Field-programmable gate arrays (FPGAs) are becoming a promising heterogeneous computing component in high-performance computing. To facilitate the usage of FPGAs for developers and researchers, high-level synthesis tools are pushing the FPGA-based design abstraction from the registertransfer level to high-level language design flow using OpenCL/C/C++. Currently, there are few studies on parallel reduction using atomic functions in the OpenCL-based design flow on an FPGA. Inspired by the reduction operation in frequent pattern compression, we transform the function into an OpenCL kernel, and describe the optimizations of the kernel on an Arria10-based FPGA platform as a case study. We found that automatic kernel vectorization does not improve the kernel performance. Users can manually vectorize the kernel to achieve performance speedup. Overall, our optimizations improve the kernel performance by a factor of 11.9 over the baseline kernel. The performance per watt of the kernel on an Intel Arria 10 GX1150 FPGA is 5.3X higher than an Intel Xeon 16-core CPU while 0.625X lower than an Nvidia K80 GPU.

[1]  Wei Zhang,et al.  A study of data partitioning on OpenCL-based FPGAs , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[2]  David A. Wood,et al.  Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches , 2004 .

[3]  George A. Constantinides,et al.  A Case for Work-stealing on FPGAs with OpenCL Atomics , 2016, FPGA.

[4]  Wenguang Chen,et al.  MapCG: Writing parallel program portable between CPU and GPU , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[6]  Maged M. Michael,et al.  High performance dynamic lock-free hash tables and list-based sets , 2002, SPAA '02.

[7]  Eddy Z. Zhang,et al.  Massive atomics for massive parallelism on GPUs , 2014, ISMM '14.

[8]  Christos-Savvas Bouganis,et al.  GPU Versus FPGA for High Productivity Computing , 2010, 2010 International Conference on Field Programmable Logic and Applications.

[9]  Roberto Torres,et al.  Algorithmic strategies for optimizing the parallel reduction primitive in CUDA , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[10]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[11]  Dirk Koch,et al.  FPGAs for Software Programmers , 2016 .

[12]  Wu-chun Feng,et al.  Performance Characterization and Optimization of Atomic Operations on AMD GPUs , 2011, 2011 IEEE International Conference on Cluster Computing.

[13]  Viktor K. Prasanna,et al.  Designing scalable FPGA-based reduction circuits using pipelined floating-point cores , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[14]  Onur Mutlu,et al.  Base-Delta-Immediate Compression: A Practical Data Compression Mechanism for On-Chip Caches , 2012 .

[15]  Yu Ting Chen,et al.  A Survey and Evaluation of FPGA High-Level Synthesis Tools , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[16]  Vincent Gramoli,et al.  More than you ever wanted to know about synchronization: synchrobench, measuring the impact of the synchronization on concurrent algorithms , 2015, PPoPP.