NARMADA: Near-Memory Horizontal Diffusion Accelerator for Scalable Stencil Computations

Real-world weather forecasting applications consist of compound stencil kernels that do not perform well on conventional architectures. This behavior is due to their complex data access patterns, limited data reusability, and low arithmetic intensity. To overcome these issues, we harness the potential of near-memory computing by offloading a horizontal diffusion kernel, which is a compound stencil kernel, from the COSMO weather prediction application to a reconfigurable fabric. We use a heterogeneous system that comprises a CPU and an FPGA with on-chip SRAM memory and on-board DRAM memory. By introducing a memory hierarchy tailored to the targeted application and using a coherent memory model, we move the computation close to the memory, which improves memory efficiency. Our hardware design on the FPGA uses high-level synthesis techniques and results in an accelerator with IBM CAPI 2.0 (Coherent Accelerator Processor Interface) technology. We evaluate it against a tuned software implementation running on an IBM POWER9 host system. The experimental results show that these kernels on an FPGA can outperform a complete 16-core POWER9 node (configured with 64 threads) by 3.3x. Moreover, our solution provides an 18x improvement in the active energy consumption.

[1]  Sander Stuijk,et al.  NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[2]  Masanori Hariyama,et al.  OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology , 2017, IEEE Transactions on Parallel and Distributed Systems.

[3]  G. Doms,et al.  The Nonhydrostatic Limited-Area Model LM (Lokal-Modell) of DWD: Part I: Scientific Documentation (Ve , 1999 .

[4]  Philip Brisk,et al.  HLSPredict: Cross Platform Performance Prediction for FPGA High-Level Synthesis , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[5]  Heiner Giefers,et al.  Accelerating arithmetic kernels with coherent attached FPGA coprocessors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[6]  Torsten Hoefler,et al.  MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures , 2015, ICS.

[7]  W. Collins,et al.  Description of the NCAR Community Atmosphere Model (CAM 3.0) , 2004 .

[8]  Jason Cong,et al.  SODA: Stencil with Optimized Dataflow Architecture , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[9]  Kiyoung Choi,et al.  ExtraV: Boosting Graph Processing Near Storage with a Coherent Accelerator , 2017, Proc. VLDB Endow..

[10]  Torsten Hoefler,et al.  Designing scalable FPGA architectures using high-level synthesis , 2018, PPoPP.

[11]  Scott Kehler,et al.  High Resolution Deterministic Prediction System (HRDPS) Simulations of Manitoba Lake Breezes , 2016 .

[12]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[13]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[15]  Heiner Giefers,et al.  ecTALK: Energy efficient coherent transprecision accelerators — The bidirectional long short-term memory neural network case , 2018, 2018 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS).

[16]  Christoph Hagleitner,et al.  A System-Level Transprecision FPGA Accelerator for BLSTM Using On-chip Memory Reshaping , 2018, 2018 International Conference on Field-Programmable Technology (FPT).

[17]  Jun A. Zhang,et al.  Evaluating the Impact of Improvement in the Horizontal Diffusion Parameterization on Hurricane Prediction in the Operational Hurricane Weather Research and Forecast (HWRF) Model , 2018 .

[18]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[19]  Guangwen Yang,et al.  Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor , 2019, ACM Trans. Archit. Code Optim..

[20]  Daniel Sánchez,et al.  Jenga: Software-defined cache hierarchies , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[21]  Robert Schmid,et al.  Getting Started with CAPI SNAP: Hardware Development for Software Engineers , 2018, Euro-Par Workshops.

[22]  Sander Stuijk,et al.  A Review of Near-Memory Computing Architectures: Opportunities and Challenges , 2018, 2018 21st Euromicro Conference on Digital System Design (DSD).