High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection

Improvements in computer systems have historically relied on two well-known observations: Moore’s law and Dennard’s scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures’ comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a convenient balance between complexity and performance. In this paper, we study modern FPGAs’ applicability in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator operating in double-precision that we empirically evaluate on the latest Stratix 10 GX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Teslaseries cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project future FPGAs’ performance and role to accelerate CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have?

[1]  Marcel Gort,et al.  From software to accelerators with LegUp high-level synthesis , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[2]  Ronan Keryell,et al.  Optimizing OpenCL applications on Xilinx FPGA , 2016, IWOCL.

[3]  Erwin Laure,et al.  Nekbone performance on GPUs with OpenACC and CUDA Fortran implementations , 2016, The Journal of Supercomputing.

[4]  Satoshi Matsuoka,et al.  Designing and accelerating spiking neural networks using OpenCL for FPGAs , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[5]  Qi Yu,et al.  DLAU: A Scalable Deep Learning Accelerator Unit on FPGA , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Jeffrey S. Vetter,et al.  Architectures for the Post-Moore Era , 2017, IEEE Micro.

[7]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[8]  Russell Tessier,et al.  FPGA Architecture: Survey and Challenges , 2008, Found. Trends Electron. Des. Autom..

[9]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Mitsuhisa Sato,et al.  PEACH2: An FPGA-based PCIe network device for Tightly Coupled Accelerators , 2014, CARN.

[11]  Haohuan Fu,et al.  Accelerating 3D convolution using streaming architectures on FPGAs , 2009 .

[12]  Satoru Yamamoto,et al.  FPGA-Based Scalable and Power-Efficient Fluid Simulation using Floating-Point DSP Blocks , 2017, IEEE Transactions on Parallel and Distributed Systems.

[13]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[14]  Michel Schanen,et al.  On the Strong Scaling of the Spectral Element Solver Nek5000 on Petascale Systems , 2016, EASC.

[15]  Niclas Jansson,et al.  Optimization of Tensor-product Operations in Nekbone on GPUs , 2020, ArXiv.

[16]  Georgi Gaydadjiev,et al.  Maxeler Data-Flow in Computational Finance , 2015 .

[17]  Timothy C. Warburton,et al.  Acceleration of tensor-product operations for high-order finite element methods , 2017, Int. J. High Perform. Comput. Appl..

[18]  Christian Plessl,et al.  Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite , 2020, ArXiv.

[19]  Christian Plessl,et al.  OpenCL Implementation of Cannon’s Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs , 2019, 2019 International Conference on Field-Programmable Technology (ICFPT).

[20]  Kentaro Sano,et al.  OpenMP Device Offloading to FPGAs Using the Nymble Infrastructure , 2020, IWOMP.

[21]  Jungwon Kim,et al.  OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Mats Brorsson,et al.  Empowering OpenMP with automatically generated hardware , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[23]  Martin C. Herbordt,et al.  An OpenCL 3D FFT for Molecular Dynamics Simulations on Multiple FPGAs , 2020, ArXiv.

[24]  Hamid Reza Zohouri,et al.  The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface , 2019, 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC).

[25]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[26]  Satoshi Matsuoka,et al.  From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era , 2016, Conf. Computing Frontiers.

[27]  K. Bernstein,et al.  Scaling, power, and the future of CMOS , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[28]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[29]  Satoshi Matsuoka,et al.  Evaluating high-level design strategies on FPGAs for high-performance computing , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[30]  Jiayi Sheng,et al.  Fully Integrated On-FPGA Molecular Dynamics Simulations , 2019, ArXiv.

[31]  Péter Szolgay,et al.  FPGA based acceleration of computational fluid flow simulation on unstructured mesh geometry , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[32]  Hal Finkel,et al.  Exploring the Random Network of Hodgkin and Huxley Neurons with Exponential Synaptic Conductances on OpenCL FPGA Platform , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[33]  Christian Plessl,et al.  Flexible FPGA design for FDTD using OpenCL , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[34]  Yoshiki Yamaguchi,et al.  FPGA-Based Computational Fluid Dynamics Simulation Architecture via High-Level Synthesis Design Method , 2020, ARC.

[35]  Masanori Hariyama,et al.  OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology , 2017, IEEE Transactions on Parallel and Distributed Systems.

[36]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[37]  Chun Chen,et al.  Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[38]  Christian Plessl,et al.  OpenCL-Based FPGA Design to Accelerate the Nodal Discontinuous Galerkin Method for Unstructured Meshes , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).