FPGA-Based Scalable and Power-Efficient Fluid Simulation using Floating-Point DSP Blocks

High-performance and low-power computation is required for large-scale fluid dynamics simulation. Due to the inefficient architecture and structure of CPUs and GPUs, they now have a difficulty in improving power efficiency for the target application. Although FPGAs become promising alternatives for power-efficient and high-performance computation due to their new architecture having floating-point (FP) DSP blocks, their relatively narrow memory bandwidth requires an appropriate way to fully exploit the advantage. This paper presents an architecture and design for scalable fluid simulation based on data-flow computing with a state-of-the-art FPGA. To exploit available hardware resources including FP DSPs, we introduce spatial and temporal parallelism to further scale the performance by adding more stream processing elements (SPEs) in an array. Performance modeling and prototype implementation allow us to explore the design space for both the existing Altera Arria10 and the upcoming Intel Stratix10 FPGAs. We demonstrate that Arria10 10AX115 FPGA achieves 519 GFlops at 9.67 GFlops/W only with a stream bandwidth of 9.0 GB/s, which is 97.9 percent of the peak performance of 18 implemented SPEs. We also estimate that Stratix10 FPGA can scale up to 6844 GFlops by combining spatial and temporal parallelism adequately.

[1]  David R. Noble,et al.  A consistent hydrodynamic boundary condition for the lattice Boltzmann method , 1995 .

[2]  Keith D. Underwood,et al.  FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[3]  Skordos,et al.  Initial and boundary conditions for the lattice Boltzmann method. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[4]  M. Januszewski,et al.  Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method , 2013, Comput. Phys. Commun..

[5]  Raffaele Tripiccione,et al.  Massively parallel lattice-Boltzmann codes on large GPU clusters , 2016, Parallel Comput..

[6]  Takaji Inamuro,et al.  A NON-SLIP BOUNDARY CONDITION FOR LATTICE BOLTZMANN SIMULATIONS , 1995, comp-gas/9508002.

[7]  Ge Wei,et al.  Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units , 2012 .

[8]  R. Wittig,et al.  Evaluating FPGAs for floating-point performance , 2008, 2008 Second International Workshop on High-Performance Reconfigurable Computing Technology and Applications.

[9]  Nikolaus A. Adams,et al.  Implementation of a Lattice–Boltzmann method for numerical fluid mechanics using the nVIDIA CUDA technology , 2009, Computer Science - Research and Development.

[10]  Wayne Luk,et al.  FPGA-based Streaming Computation for Lattice Boltzmann Method , 2007, 2007 International Conference on Field-Programmable Technology.

[11]  Satoru Yamamoto,et al.  Scalability analysis of tightly-coupled FPGA-cluster for lattice Boltzmann computation , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[12]  Tsutomu Maruyama,et al.  A Cellular Automata System with FPGA , 2001, The 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'01).

[13]  Peter M. A. Sloot,et al.  Compute Bound and I/O Bound Cellular Automata Simulations on FPGA Logic , 2009, TRETS.

[14]  Wim Vanderbauwhede,et al.  High-Performance Computing Using FPGAs , 2013 .

[15]  Satoru Yamamoto,et al.  Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[16]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[17]  Satoru Yamamoto,et al.  Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth , 2014, IEEE Transactions on Parallel and Distributed Systems.

[18]  Kentaro Sano DSL-based Design Space Exploration for Temporal and Spatial Parallelism of Custom Stream Computing , 2015, ArXiv.

[19]  Martin Langhammer,et al.  Floating-Point DSP Block Architecture for FPGAs , 2015, FPGA.

[20]  Valavan Manohararajah,et al.  The Stratix™ 10 Highly Pipelined FPGA Architecture , 2016, FPGA.

[21]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[22]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[23]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[24]  Shiyi Chen,et al.  Lattice Boltzmann computational fluid dynamics in three dimensions , 1992 .

[25]  G. D. Peterson,et al.  Power Aware Computing on GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[26]  Peter Bailey,et al.  Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors , 2009, 2009 International Conference on Parallel Processing.

[27]  Satoru Yamamoto,et al.  Evaluating power and energy consumption of FPGA-based custom computing machines for scientific floating-point computation , 2008, 2008 International Conference on Field-Programmable Technology.

[28]  Gerhard Wellein,et al.  Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results , 2011, ArXiv.

[29]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[30]  Massimo Bernaschi,et al.  A flexible high‐performance Lattice Boltzmann GPU code for the simulations of fluid flows in complex geometries , 2010, Concurr. Comput. Pract. Exp..

[31]  Stuart D. C. Walsh,et al.  Performance analysis of single‐phase, multiphase, and multicomponent lattice‐Boltzmann fluid flow simulations on GPU clusters , 2011, Concurr. Comput. Pract. Exp..

[32]  Mário P. Véstias,et al.  Trends of CPU, GPU and FPGA for high-performance computing , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[33]  Hiroshi Sasaki,et al.  Power and Performance Analysis of GPU-Accelerated Systems , 2012, HotPower.

[34]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[35]  Ryo Ito,et al.  Efficient custom computing of fully-streamed lattice boltzmann method on tightly-coupled FPGA cluster , 2013, CARN.

[36]  Martin Burtscher,et al.  Measuring GPU Power with the K20 Built-in Sensor , 2014, GPGPU@ASPLOS.

[37]  Hiroaki Kobayashi,et al.  Radiative Heat Transfer Simulation Using Programmable Graphics Hardware , 2006, 5th IEEE/ACIS International Conference on Computer and Information Science and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering,Software Architecture and Reuse (ICIS-COMSAR'06).

[38]  Satoru Yamamoto,et al.  FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods , 2010, TRETS.

[39]  Peter M. A. Sloot,et al.  Performance Modeling of 2D Cellular Automata on FPGA , 2007, 2007 International Conference on Field Programmable Logic and Applications.