A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future Challenges

The impending termination of Moore’s law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand. In this work, we design a custom FPGA-based accelerator for a computational fluid dynamics (CFD) code. Unlike prior work – which often focuses on accelerating small kernels – we target the entire Poisson solver on unstructured meshes based on the high-fidelity spectral element method (SEM) used in modern state-of-the-art CFD systems. We model our accelerator using an analytical performance model based on the I/O cost of the algorithm. We empirically evaluate our accelerator on a state-of-the-art Intel Stratix 10 FPGA in terms of performance and power consumption and contrast it against existing solutions on general-purpose processors (CPUs). Finally, we propose a data movement-reducing technique where we compute geometric factors on the fly, which yields significant (700+ Gflop/s) single-precision performance and an upwards of 2x reduction in runtime for the local evaluation of the Laplace operator. We end the paper by discussing the challenges and opportunities of using reconfigurable architecture in the future, particularly in the light of emerging (not yet available) technologies.

[1]  Christian Plessl,et al.  OpenCL-Based FPGA Design to Accelerate the Nodal Discontinuous Galerkin Method for Unstructured Meshes , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[2]  Torsten Hoefler,et al.  Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication , 2019, SC.

[3]  David A. Patterson,et al.  Motivation for and Evaluation of the First Tensor Processing Unit , 2018, IEEE Micro.

[4]  Péter Szolgay,et al.  FPGA based acceleration of computational fluid flow simulation on unstructured mesh geometry , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[5]  M. Mitchell Waldrop,et al.  The chips are down for Moore’s law , 2016, Nature.

[6]  Torsten Hoefler,et al.  Transformations of High-Level Synthesis Codes for High-Performance Computing , 2018, IEEE Transactions on Parallel and Distributed Systems.

[7]  Niclas Jansson,et al.  Neko: A Modern, Portable, and Scalable Framework for High-Fidelity Computational Fluid Dynamics , 2021, Computers & Fluids.

[8]  P. Fischer,et al.  High-Order Methods for Incompressible Fluid Flow , 2002 .

[9]  Catherine D. Schuman,et al.  A Survey of Neuromorphic Computing and Neural Networks in Hardware , 2017, ArXiv.

[10]  Carlos Carreras,et al.  Memory optimization in FPGA-accelerated scientific codes based on unstructured meshes , 2014, J. Syst. Archit..

[11]  Jeffrey S. Vetter,et al.  Architectures for the Post-Moore Era , 2017, IEEE Micro.

[12]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[13]  Kentaro Sano,et al.  A Template-based Framework for Exploring Coarse-Grained Reconfigurable Architectures , 2020, 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[14]  J. Ramanujam,et al.  On characterizing the data movement complexity of computational DAGs for parallel execution , 2014, SPAA.

[15]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[16]  Fabrizio Ferrandi,et al.  Bambu: A modular framework for the high level synthesis of memory-intensive applications , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[17]  Elia Merzari,et al.  NekRS, a GPU-Accelerated Spectral Element Navier-Stokes Solver , 2021, Parallel Comput..

[18]  Torsten Hoefler,et al.  Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs , 2021, SPAA.

[19]  Satoshi Matsuoka,et al.  Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL , 2018, FPGA.

[20]  Paolo Ienne,et al.  Using 3D integration technology to realize multi-context FPGAs , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[21]  Christian Plessl,et al.  High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection , 2020, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[23]  Guy Lemieux,et al.  ZUMA: An Open FPGA Overlay Architecture , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[24]  Jason Cong,et al.  FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[25]  Tobias Kenter,et al.  Algorithm-hardware co-design of a discontinuous Galerkin shallow-water model for a dataflow architecture on FPGA , 2021, PASC.

[26]  Greg Stitt,et al.  FPGA Acceleration of Fluid-Flow Kernels , 2020, 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC).

[27]  Nikoli Dryden,et al.  Data Movement Is All You Need: A Case Study on Optimizing Transformers , 2020, MLSys.

[28]  Russell Tessier,et al.  FPGA Architecture: Survey and Challenges , 2008, Found. Trends Electron. Des. Autom..

[29]  Mats Brorsson,et al.  Empowering OpenMP with automatically generated hardware , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[30]  Hossein Omidian,et al.  A Domain-Specific Architecture for Accelerating Sparse Matrix Vector Multiplication on FPGAs , 2020, 2020 30th International Conference on Field-Programmable Logic and Applications (FPL).

[31]  Axel Jantsch,et al.  A survey of memory architecture for 3D chip multi-processors , 2014, Microprocess. Microsystems.

[32]  Laszlo Gyongyosi,et al.  A Survey on quantum computing technology , 2019, Comput. Sci. Rev..

[33]  Kentaro Sano FPGA-Based Systolic Computational-Memory Array for Scalable Stencil Computations , 2013 .

[34]  P. Briggs,et al.  Rematerialization , 1992, PLDI.

[35]  Torsten Hoefler,et al.  Graph Processing on FPGAs: Taxonomy, Survey, Challenges , 2019, ArXiv.

[36]  J. Ramanujam,et al.  On Using the Roofline Model with Lower Bounds on Data Movement , 2015, ACM Trans. Archit. Code Optim..

[37]  Nick Brown,et al.  Exploring the acceleration of Nekbone on reconfigurable architectures , 2020, 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC).

[38]  Satoshi Matsuoka,et al.  Designing and accelerating spiking neural networks using OpenCL for FPGAs , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[39]  Kentaro Sano,et al.  A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective , 2020, IEEE Access.

[40]  Phillip H. Jones,et al.  Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels , 2019, 2019 IEEE International Conference on Embedded Software and Systems (ICESS).

[41]  George Karypis,et al.  Parmetis parallel graph partitioning and sparse matrix ordering library , 1997 .

[42]  Georgi Gaydadjiev,et al.  Maxeler Data-Flow in Computational Finance , 2015 .

[43]  Christian Plessl,et al.  Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite , 2020, 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC).

[44]  Hamid Reza Zohouri,et al.  The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface , 2019, 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC).

[45]  Wayne Luk,et al.  Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[46]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[47]  William Gropp,et al.  CFD Vision 2030 Study: A Path to Revolutionary Computational Aerosciences , 2014 .

[48]  C. W. Glass,et al.  Performance Modeling of the HPCG Benchmark , 2014, PMBS@SC.