CFD code adaptation to the FPGA architecture

For the last years, we observe the intensive development of accelerated computing platforms. Although current trends indicate a well-established position of GPU devices in the HPC environment, FPGA (Field-Programmable Gate Array) aspires to be an alternative solution to offload the CPU computation. This paper presents a systematic adaptation of four various CFD (Computational Fluids Dynamic) kernels to the Xilinx Alveo U250 FPGA. The goal of this paper is to investigate the potential of the FPGA architecture as the future infrastructure able to provide the most complex numerical simulations in the area of fluid flow modeling. The selected kernels are customized to a real-scientific scenario, compatible with the EULAG (Eulerian/semi-Lagrangian) fluid solver. The solver is used to simulate thermo-fluid flows across a wide range of scales and is extensively used in numerical weather prediction. The proposed adaptation is focused on the analysis of the strengths and weaknesses of the FPGA accelerator, considering performance and energy efficiency. The proposed adaptation is compared with a CPU implementation that was strongly optimized to provide realistic and objective benchmarks. The performance results are compared with a set of server CPUs containing various Intel generations, including Intel SkyLake-based CPUs as Xeon Gold 6148 and Xeon Platinum 8168, as well as Intel Xeon E5-2695 CPU based on the IvyBridge architecture. Since all the kernels belong to the group of memory-bound algorithms, our main challenge is to saturate global memory bandwidth and provide data locality with the intensive BRAM (Block RAM) reusing. Our adaptation allows us to reduce the performance per watt up to 80% compared to the CPUs.

[1]  Krzysztof Rojek,et al.  Machine learning method for energy reduction by utilizing dynamic mixed precision on GPU‐based supercomputers , 2019, Concurr. Comput. Pract. Exp..

[2]  Christian Kühnlein,et al.  A consistent framework for discrete integrations of soundproof and compressible PDEs of atmospheric dynamics , 2014, J. Comput. Phys..

[3]  Sergio Iserte,et al.  An study of the effect of process malleability in the energy efficiency on GPU-based clusters , 2019, The Journal of Supercomputing.

[4]  Roman Wyrzykowski,et al.  Systematic adaptation of stencil‐based 3D MPDATA to GPU architectures , 2017, Concurr. Comput. Pract. Exp..

[5]  Giulio Giunta,et al.  Accelerating Linux and Android applications on low‐power devices through remote GPGPU offloading , 2017, Concurr. Comput. Pract. Exp..

[6]  Yuichiro Shibata,et al.  Power Performance Profiling of 3-D Stencil Computation on an FPGA Accelerator for Efficient Pipeline Optimization , 2016, CARN.

[7]  Bogdan Rosa,et al.  A Study on Parallel Performance of the EULAG F90/95 Code , 2011, PPAM.

[8]  Roman Wyrzykowski,et al.  Performance modeling of 3D MPDATA simulations on GPU cluster , 2016, The Journal of Supercomputing.

[9]  Leszek Marcinkowski,et al.  Parallel ADI Preconditioners for All-Scale Atmospheric Models , 2015, PPAM.

[10]  Lukasz Szustak,et al.  Adaptation of fluid model EULAG to graphics processing unit architecture , 2015, Concurr. Comput. Pract. Exp..

[11]  RojekKrzysztof Andrzej,et al.  Adaptation of fluid model EULAG to graphics processing unit architecture , 2015 .

[12]  J. Prusa,et al.  EULAG, a computational model for multiscale flows , 2008 .

[13]  T. Hoefler,et al.  Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis , 2019, FPGA.

[14]  Bo Yu,et al.  GPU Acceleration of CFD Algorithm: HSMAC and SIMPLE , 2017, ICCS.

[15]  Satoru Yamamoto,et al.  Domain-Specific Language and Compiler for Stencil Computation on FPGA-Based Systolic Computational-Memory Array , 2012, ARC.

[16]  Giulio Giunta,et al.  Virtualizing CUDA Enabled GPGPUs on ARM Clusters , 2015, PPAM.

[17]  Masanori Hariyama,et al.  FPGA-based deep-pipelined architecture for FDTD acceleration using OpenCL , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[18]  Enrique S. Quintana-Ortí,et al.  Modeling power consumption of 3D MPDATA and the CG method on ARM and Intel multicore architectures , 2017, The Journal of Supercomputing.

[19]  Masanori Hariyama,et al.  OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology , 2017, IEEE Transactions on Parallel and Distributed Systems.

[20]  Piotr K. Smolarkiewicz,et al.  Multidimensional positive definite advection transport algorithm: an overview , 2006 .

[21]  Pawel Gepner,et al.  Elliptic Solver Performance Evaluation on Modern Hardware Architectures , 2013, PPAM.

[22]  Torsten Hoefler,et al.  Designing scalable FPGA architectures using high-level synthesis , 2018, PPoPP.

[23]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[24]  Leonel Sousa,et al.  Energy‐aware mechanism for stencil‐based MPDATA algorithm with constraints , 2017, Concurr. Comput. Pract. Exp..