论文信息 - Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture

Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture

Stencils represent a class of computational patterns where an output grid point depends on a fixed shape of neighboring points in an input grid. Stencil computations are prevalent in scientific applications engaging a significant portion of supercomputing resources. Therefore, it has been always important to optimize stencil programs for the best performance. A rich body of research has focused on optimizing stencil computations on almost all parallel architectures. Stencil applications have regular dependency patterns, inherent pipeline-parallelism, and plenty of data reuse. This makes these applications a perfect match for a coarse-grained reconfigurable spatial architecture (CGRA). A CGRA consists of many simple, small processing elements (PEs) connected with an on-chip network. Each PE can be configured to execute part of a stencil computation and all PEs run in parallel; the network can also be configured so that data loaded can be passed from a PE to a neighbor PE directly and thus reused by many PEs without register spilling and memory traffic. How to efficiently map a stencil computation to a CGRA is the key to performance. In this paper, we show a few unique and generalizable ways of mapping one- and multidimensional stencil computations to a CGRA, fully exploiting the data reuse opportunities and parallelism. Our simulation experiments demonstrate that these mappings are efficient and enable the CGRA to outperform state-of-the-art GPUs.

Hongbo Rong | Carl Ebeling | Fabrizio Petrini | Jesmin Jahan Tithi | Andrei Valentin

[1] W. Marsden. I and J , 2012 .

[2] Yuan Tang,et al. Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms , 2017, SPAA.

[3] K. Yee. Numerical solution of initial boundary value problems involving maxwell's equations in isotropic media , 1966 .

[4] Mehdi Baradaran Tahoori,et al. Energy Efficient Scientific Computing on FPGAs using OpenCL , 2017, FPGA.

[5] Haibin Kan,et al. Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance , 2014 .

[6] Charles L. Byrne,et al. Applied Iterative Methods , 2007 .

[7] Masanori Hariyama,et al. OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology , 2017, IEEE Transactions on Parallel and Distributed Systems.

[8] Paulius Micikevicius,et al. 3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[9] Satoshi Matsuoka,et al. Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[10] Satoru Yamamoto,et al. Domain-Specific Language and Compiler for Stencil Computation on FPGA-Based Systolic Computational-Memory Array , 2012, ARC.

[11] Satoru Yamamoto,et al. Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[12] Joel S. Emer,et al. Exploiting spatial architectures for edit distance algorithms , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13] Antonia Zhai,et al. Efficient Spatial Processing Element Control via Triggered Instructions , 2014, IEEE Micro.

[14] Huiyang Zhou,et al. Tuning Stencil codes in OpenCL for FPGAs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[15] G. Karniadakis,et al. Spectral/hp Element Methods for Computational Fluid Dynamics , 2005 .

[16] Naoya Maruyama,et al. Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .