Smart-Cache: Optimising Memory Accesses for Arbitrary Boundaries and Stencils on FPGAs

A key requirement for high performance on FPGAs is to maintain continuous data streaming from the DRAM. An impediment in many computations, especially in the scientific computing domain, is irregular stencils and boundary conditions, requiring memory accesses that are random, redundant, or both. To address this problem, we present Smache, a novel smart-caching framework that uses FPGA on-chip memory resources for optimising access for arbitrary stencil shapes and boundary conditions. We propose a combination of stream and static buffers, and it is the latter that allows arbitrarily large offsets in stencils. The architecture is complemented by a formal model for determining buffer configuration. We propose a hybrid use of the block and distributed RAM on the FPGA. The design is validated for a 2D grid, 4-point stencil with circular boundaries.

[1]  Ghislain Roquier,et al.  Synthesizing Hardware from Dataflow Programs , 2008, 2008 IEEE Workshop on Signal Processing Systems.

[2]  Oliver Pell,et al.  Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[3]  Eduard Ayguadé,et al.  Exploiting memory customization in FPGA for 3D stencil computations , 2009, 2009 International Conference on Field-Programmable Technology.

[4]  David Atienza,et al.  A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Jason Cong,et al.  An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Syed Waqar Nabi,et al.  MP-STREAM: A Memory Performance Benchmark for Design Space Exploration on Heterogeneous HPC Devices , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Haohuan Fu,et al.  Accelerating 3D convolution using streaming architectures on FPGAs , 2009 .

[8]  Masanori Hariyama,et al.  OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology , 2017, IEEE Transactions on Parallel and Distributed Systems.

[9]  Satoru Yamamoto,et al.  Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[10]  Marco D. Santambrogio,et al.  A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11]  Marc Reichenbach,et al.  A Generic VHDL Template for 2D Stencil Code Applications on FPGAs , 2012, 2012 IEEE 15th International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops.