Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.

[1]  Mike Hutton Stratix® 10: 14nm FPGA delivering 1GHz , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[2]  Naoyuki Onodera,et al.  High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Satoru Yamamoto,et al.  FPGA-Based Scalable and Power-Efficient Fluid Simulation using Floating-Point DSP Blocks , 2017, IEEE Transactions on Parallel and Distributed Systems.

[4]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[5]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[6]  Marco D. Santambrogio,et al.  On How to Accelerate Iterative Stencil Loops , 2015, ACM Trans. Archit. Code Optim..

[7]  Jun Zhou,et al.  Physics-based seismic hazard analysis on petascale heterogeneous supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  J. Ramanujam,et al.  SDSLc: a multi-target domain-specific compiler for stencil computations , 2015, WOLFHPC@SC.

[9]  Tomofumi Yuki,et al.  One size does not fit all: Implementation trade-offs for iterative stencil computations on FPGAs , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[10]  Eriko Nurvitadhi,et al.  Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[11]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Yun Liang,et al.  A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Chao Yang,et al.  Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2 , 2015, IEEE Transactions on Computers.

[14]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Frank O. Bryan,et al.  Impact of ocean model resolution on CCSM climate simulations , 2012, Climate Dynamics.

[16]  Christian Plessl,et al.  Flexible FPGA design for FDTD using OpenCL , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[17]  R. Neale,et al.  Improvements in a half degree atmosphere/land version of the CCSM , 2010 .

[18]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Masanori Hariyama,et al.  OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology , 2017, IEEE Transactions on Parallel and Distributed Systems.