Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture

Stencils represent a class of computational patterns where an output grid point depends on a fixed shape of neighboring points in an input grid. Stencil computations are prevalent in scientific applications engaging a significant portion of supercomputing resources. Therefore, it has been always important to optimize stencil programs for the best performance. A rich body of research has focused on optimizing stencil computations on almost all parallel architectures. Stencil applications have regular dependency patterns, inherent pipeline-parallelism, and plenty of data reuse. This makes these applications a perfect match for a coarse-grained reconfigurable spatial architecture (CGRA). A CGRA consists of many simple, small processing elements (PEs) connected with an on-chip network. Each PE can be configured to execute part of a stencil computation and all PEs run in parallel; the network can also be configured so that data loaded can be passed from a PE to a neighbor PE directly and thus reused by many PEs without register spilling and memory traffic. How to efficiently map a stencil computation to a CGRA is the key to performance. In this paper, we show a few unique and generalizable ways of mapping one- and multidimensional stencil computations to a CGRA, fully exploiting the data reuse opportunities and parallelism. Our simulation experiments demonstrate that these mappings are efficient and enable the CGRA to outperform state-of-the-art GPUs.

[1]  W. Marsden I and J , 2012 .

[2]  Yuan Tang,et al.  Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms , 2017, SPAA.

[3]  K. Yee Numerical solution of initial boundary value problems involving maxwell's equations in isotropic media , 1966 .

[4]  Mehdi Baradaran Tahoori,et al.  Energy Efficient Scientific Computing on FPGAs using OpenCL , 2017, FPGA.

[5]  Haibin Kan,et al.  Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance , 2014 .

[6]  Charles L. Byrne,et al.  Applied Iterative Methods , 2007 .

[7]  Masanori Hariyama,et al.  OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology , 2017, IEEE Transactions on Parallel and Distributed Systems.

[8]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[9]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[10]  Satoru Yamamoto,et al.  Domain-Specific Language and Compiler for Stencil Computation on FPGA-Based Systolic Computational-Memory Array , 2012, ARC.

[11]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[12]  Joel S. Emer,et al.  Exploiting spatial architectures for edit distance algorithms , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13]  Antonia Zhai,et al.  Efficient Spatial Processing Element Control via Triggered Instructions , 2014, IEEE Micro.

[14]  Huiyang Zhou,et al.  Tuning Stencil codes in OpenCL for FPGAs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[15]  G. Karniadakis,et al.  Spectral/hp Element Methods for Computational Fluid Dynamics , 2005 .

[16]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .