StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate the generated architectures on an FPGA testbed, demonstrating the highest single-device and multi-device performance recorded for stencil programs on FPGAs to date, then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice.

[1]  LukWayne,et al.  Automating Elimination of Idle Functions by Runtime Reconfiguration , 2015 .

[2]  Torsten Hoefler,et al.  Dawn: a High-level Domain-Specific Language Compiler Toolchain for Weather and Climate Applications , 2020, Supercomput. Front. Innov..

[3]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[4]  Torsten Hoefler,et al.  Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.

[5]  C. Schär,et al.  Long-Term Simulations of Thermally Driven Flows and Orographic Convection at Convection-Parameterizing and Cloud-Resolving Resolutions , 2013 .

[6]  Felix Ament,et al.  Assessing the Benefits of Convection-Permitting Models by Neighborhood Verification: Examples from MAP D-PHASE , 2010 .

[7]  Sander Stuijk,et al.  NARMADA: Near-Memory Horizontal Diffusion Accelerator for Scalable Stencil Computations , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[8]  Jason Cong,et al.  HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration , 2020, FPGA.

[9]  J. Smagorinsky,et al.  GENERAL CIRCULATION EXPERIMENTS WITH THE PRIMITIVE EQUATIONS , 1963 .

[10]  Haohuan Fu,et al.  Eliminating the memory bottleneck: an FPGA-based solution for 3d reverse time migration , 2011, FPGA '11.

[11]  Jason Cong,et al.  HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing , 2019, FPGA.

[12]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[13]  Torsten Hoefler,et al.  Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Paul Hudak,et al.  Serial Combinators: "Optimal" Grains of Parallelism , 1985, FPCA.

[15]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.

[16]  Torsten Hoefler,et al.  MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.

[17]  Chirag Ravishankar,et al.  Xilinx Adaptive Compute Acceleration Platform: VersalTM Architecture , 2019, FPGA.

[18]  Sander Stuijk,et al.  NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling , 2020, 2020 30th International Conference on Field-Programmable Logic and Applications (FPL).

[19]  Torsten Hoefler,et al.  Transformations of High-Level Synthesis Codes for High-Performance Computing , 2018, IEEE Transactions on Parallel and Distributed Systems.

[20]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[21]  Tobias Gysi,et al.  STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  C. Schär,et al.  A Case Study in Modeling Low-Lying Inversions and Stratocumulus Cloud Cover in the Bay of Biscay , 2014 .

[23]  M. Baldauf,et al.  Operational Convective-Scale Numerical Weather Prediction with the COSMO Model: Description and Sensitivities , 2011 .

[24]  Masanori Hariyama,et al.  Multi-FPGA Accelerator Architecture for Stencil Computation Exploiting Spacial and Temporal Scalability , 2019, IEEE Access.

[25]  Mohamed Wahib,et al.  AN5D: automated stencil framework for high-degree temporal blocking on GPUs , 2020, CGO.

[26]  Torsten Hoefler,et al.  Streaming message interface: high-performance distributed memory programming on reconfigurable hardware , 2019, SC.

[27]  Jason Cong,et al.  SODA: Stencil with Optimized Dataflow Architecture , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[28]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[29]  Pat Hanrahan,et al.  Darkroom , 2014, ACM Trans. Graph..

[30]  Yun Liang,et al.  A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[31]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).