High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-the-art YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases.

[1]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[2]  Haohuan Fu,et al.  Eliminating the memory bottleneck: an FPGA-based solution for 3d reverse time migration , 2011, FPGA '11.

[3]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[4]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Satoru Yamamoto,et al.  FPGA-Based Scalable and Power-Efficient Fluid Simulation using Floating-Point DSP Blocks , 2017, IEEE Transactions on Parallel and Distributed Systems.

[6]  Marco D. Santambrogio,et al.  On How to Accelerate Iterative Stencil Loops , 2015, ACM Trans. Archit. Code Optim..

[7]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[8]  Eduard Ayguadé,et al.  Exploiting memory customization in FPGA for 3D stencil computations , 2009, 2009 International Conference on Field-Programmable Technology.

[9]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Stephen John Turner,et al.  Optimizing and Auto-Tuning Iterative Stencil Loops for GPUs with the In-Plane Method , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[11]  Masanori Hariyama,et al.  OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology , 2017, IEEE Transactions on Parallel and Distributed Systems.

[12]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  Chao Yang,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Christian Plessl,et al.  Flexible FPGA design for FDTD using OpenCL , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[16]  Charles Yount,et al.  Vector Folding: Improving Stencil Performance via Multi-dimensional SIMD-vector Representation , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[17]  Hikaru Inoue,et al.  Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by Automated Generation and Autotuning of Temporal Blocking Codes , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Andrew C. Ling,et al.  An OpenCL(TM) Deep Learning Accelerator on Arria 10 , 2017 .

[19]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Alejandro Duran,et al.  Effective Use of Large High-Bandwidth Memory Caches in HPC Stencil Computation via Temporal Wave-Front Tiling , 2016, 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[21]  Andrew C. Ling,et al.  An OpenCL™ Deep Learning Accelerator on Arria 10 , 2017, FPGA.