OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology

Stencil computation is widely used in scientific computations and many accelerators based on multicore CPUs and GPUs have been proposed. Stencil computation has a small operational intensity so that a large external memory bandwidth is usually required for high performance. FPGAs have the potential to solve this problem by utilizing large internal memory efficiently. However, a very large design, testing and debugging time is required to implement an FPGA architecture successfully. To solve this problem, we propose an FPGA-platform using C-like programming language called open computing language (OpenCL). We also propose an optimization methodology to find the optimal architecture for a given application using the proposed FPFA-platform. According to the experimental results, we achieved 119 <inline-formula><tex-math notation="LaTeX">$\sim$</tex-math><alternatives> <inline-graphic xlink:href="waidyasooriya-ieq1-2614981.gif"/></alternatives></inline-formula> 237 Gflop/s of processing power and higher processing speed compared to conventional GPU and multicore CPU implementations.

[1]  G. Karniadakis,et al.  Spectral/hp Element Methods for Computational Fluid Dynamics , 2005 .

[2]  Jia Guo,et al.  Writing productive stencil codes with overlapped tiling , 2009 .

[3]  Satoru Yamamoto,et al.  Domain-Specific Language and Compiler for Stencil Computation on FPGA-Based Systolic Computational-Memory Array , 2012, ARC.

[4]  David Atienza,et al.  A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  K. Yee Numerical solution of initial boundary value problems involving maxwell's equations in isotropic media , 1966 .

[6]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[7]  Hans-Peter Seidel,et al.  Cache Accurate Time Skewing in Iterative Stencil Computations , 2011, 2011 International Conference on Parallel Processing.

[8]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[9]  José M. García,et al.  CUDA 2D Stencil Computations for the Jacobi Method , 2010, PARA.

[10]  Gerhard Wellein,et al.  Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[11]  Achieving One TeraFLOPS with 28-nm FPGAs , 2010 .

[12]  Eduard Ayguadé,et al.  Exploiting memory customization in FPGA for 3D stencil computations , 2009, 2009 International Conference on Field-Programmable Technology.

[13]  Tomofumi Yuki,et al.  Towards Scalable and Efficient FPGA Stencil Accelerators , 2016, HiPEAC 2016.

[14]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[15]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[16]  Tomofumi Yuki,et al.  Towards Scalable and Efficient FPGA Stencil Accelerators Work-In-Progress , 2016 .

[17]  Apan Qasem,et al.  Understanding stencil code performance on multicore architectures , 2011, CF '11.

[18]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[19]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[20]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[21]  Uday Bondhugula,et al.  Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[23]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[24]  Masanori Hariyama,et al.  FPGA-based deep-pipelined architecture for FDTD acceleration using OpenCL , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[25]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[26]  Yuichiro Shibata,et al.  Performance Modeling of Stencil Computing on a Stream-Based FPGA Accelerator for Efficient Design Space Exploration , 2015, IEICE Trans. Inf. Syst..

[27]  Masanori Hariyama,et al.  OpenCL-Based Design of an FPGA Accelerator for Phase-Based Correspondence Matching , 2015 .

[28]  Charles L. Byrne,et al.  Applied Iterative Methods , 2007 .

[29]  Yuichiro Shibata,et al.  Power Performance Profiling of 3-D Stencil Computation on an FPGA Accelerator for Efficient Pipeline Optimization , 2016, CARN.

[30]  G. Roth,et al.  Compiling Stencils in High Performance Fortran , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[31]  Kevin Skadron,et al.  A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.

[32]  John Freeman,et al.  OpenCL for FPGAs: Prototyping a Compiler , 2013 .