SODA: Stencil with Optimized Dataflow Architecture

Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data dependencies, difficulties of programming FPGAs with RTL, as well as large design space. In this paper we present SODA, an automated framework for implementing Stencil algorithms with Optimized Dataflow Architecture on FPGAs. The SODA microarchitecture minimizes the on-chip reuse buffer size required by full data reuse and provides flexible and scalable fine-grained parallelism. The SODA automation framework takes high-level user input and generates efficient, high-frequency dataflow implementation. This significantly reduces the difficulty of programming FPGAs efficiently for stencil algorithms. The SODA design-space exploration framework models the resource constraints and searches for the performance-optimized configuration with accurate models for post-synthesis resource utilization and on-board execution throughput. Experimental results from on-board execution using a wide range of benchmarks show up to 3.28x speed up over 24-thread CPU and our fully automated framework achieves better performance compared with manually designed state-of-the-art FPGA accelerators.

[1]  Xuan Yang,et al.  Programming Heterogeneous Systems from an Image Processing DSL , 2016, ACM Trans. Archit. Code Optim..

[2]  Eduard Ayguadé,et al.  Exploiting memory customization in FPGA for 3D stencil computations , 2009, 2009 International Conference on Field-Programmable Technology.

[3]  Alan C. Bovik,et al.  The Essential Guide to Image Processing , 2009, J. Electronic Imaging.

[4]  Marco D. Santambrogio,et al.  A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[5]  Tom Feist,et al.  Vivado Design Suite , 2012 .

[6]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[7]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[8]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[10]  Jürgen Teich,et al.  Generating FPGA-based image processing accelerators with Hipacc: (Invited paper) , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[12]  Klaus Sutner,et al.  Computation theory of cellular automata , 1998 .

[13]  Yun Liang,et al.  A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[14]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[15]  J. Ramanujam,et al.  A framework for enhancing data reuse via associative reordering , 2014, PLDI.

[16]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[17]  Nachiket Kapre,et al.  Energy-Efficient Acceleration of OpenCV Saliency Computation Using Soft Vector Processors , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[18]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Jason Cong,et al.  An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[20]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[21]  Greg Stitt,et al.  Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA Systems , 2018, FPGA.

[22]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[23]  Shih-Wei Liao,et al.  Locality-Aware Scheduling for Stencil Code in Halide , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[24]  Jason Cong,et al.  Optimizing memory hierarchy allocation with loop transformations for high-level synthesis , 2012, DAC Design Automation Conference 2012.

[25]  Mingjie Lin,et al.  Graph-Theoretically Optimal Memory Banking for Stencil-Based Computing Kernels , 2018, FPGA.

[26]  Gerhard Wellein,et al.  Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[27]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[28]  Pat Hanrahan,et al.  Darkroom , 2014, ACM Trans. Graph..

[29]  G. Roth,et al.  Compiling Stencils in High Performance Fortran , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[30]  Jason Cong,et al.  An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[31]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[32]  Jason Cong,et al.  A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[33]  Jason Cong,et al.  Latte: Locality Aware Transformation for High-Level Synthesis , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[34]  Alan C. Bovik,et al.  The Essential Guide to Video Processing , 2009, J. Electronic Imaging.