A hierarchical parallel implementation for heterogeneous computing. Application to algebra-based CFD simulations on hybrid supercomputers

Abstract The quest for new portable implementations of simulation algorithms is motivated by the increasing variety of computing architectures. Moreover, the hybridization of high-performance computing systems imposes additional constraints, since heterogeneous computations are needed to efficiently engage processors and massively-parallel accelerators. This, in turn, involves different parallel paradigms and computing frameworks and requires complex data exchanges between computing units. Typically, simulation codes rely on sophisticated data structures and computing subroutines, so-called kernels, which makes portability terribly cumbersome. Thus, a natural way to achieve portability is to dramatically reduce the complexity of both data structures and computing kernels. In our algebra-based approach, the scale-resolving simulation of incompressible turbulent flows on unstructured meshes relies on three fundamental kernels: the sparse matrix-vector product, the linear combination of vectors and the dot product. It is noteworthy that this approach is not limited to a particular kind of numerical method or a set of governing equations. In our code, an auto-balanced multilevel partitioning distributes workload among computing devices of various architectures. The overlap of computations and multistage communications efficiently hides the data exchanges overhead in large-scale supercomputer simulations. In addition to computing on accelerators, special attention is paid at efficiency on manycore processors in multiprocessor nodes with significant non-uniform memory access factor. Parallel efficiency and performance are studied in detail for different execution modes on various supercomputers using up to 9,600 processor cores and up to 256 graphics processor units. The heterogeneous implementation model described in this work is a general-purpose approach that is well suited for various subroutines in numerical simulation codes.

[1]  F. Xavier Trias,et al.  A simple approach to discretize the viscous term with spatially varying (eddy-)viscosity , 2013, J. Comput. Phys..

[2]  Boris I. Krasnopolsky,et al.  Acceleration of Large Scale OpenFOAM Simulations on Distributed Systems with Multicore CPUs and GPUs , 2015, PARCO.

[3]  Joseph L. Greathouse,et al.  clSPARSE: A Vendor-Optimized Open-Source Sparse BLAS Library , 2016, IWOCL.

[4]  F. X. Trias,et al.  A Self-Adaptive Strategy for the Time Integration of Navier-Stokes Equations , 2011 .

[5]  T. Takaki,et al.  GPU-accelerated phase-field simulation of dendritic solidification in a binary alloy , 2011 .

[6]  George Karypis,et al.  Multi-threaded Graph Partitioning , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[7]  N. M. Evstigneev,et al.  Implicit method for the solution of supersonic and hypersonic 3D flow problems with Lower-Upper Symmetric-Gauss-Seidel preconditioner on multiple graphics processing units , 2020, J. Comput. Phys..

[8]  Eduard Ayguadé,et al.  The Mont-Blanc Prototype: An Alternative Approach for HPC Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Freddie D. Witherden,et al.  Heterogeneous Computing on Mixed Unstructured Grids with PyFR , 2014, ArXiv.

[11]  F. X. Trias,et al.  A scalable parallel Poisson solver for three-dimensional problems with one periodic direction , 2010 .

[12]  A. Gorobets,et al.  Heterogeneous Computing in Resource-Intensive CFD Simulations , 2018, Doklady Mathematics.

[13]  O. Lehmkuhl,et al.  Large Eddy Simulations (LES) on the Flow and Heat Transfer in a Wall-Bounded Pin Matrix , 2014 .

[14]  Christopher J. Roy,et al.  Heterogeneous Computing of CFD Applications on CPU-GPU Platforms using OpenACC Directives , 2020 .

[15]  Freddie D. Witherden,et al.  Towards Green Aviation with Python at Petascale , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  A. Chorin Numerical solution of the Navier-Stokes equations , 1968 .

[17]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[18]  Ricard Borrell,et al.  Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics , 2020, Future Gener. Comput. Syst..

[19]  Kenli Li,et al.  A hybrid computing method of SpMV on CPU-GPU heterogeneous computing systems , 2017, J. Parallel Distributed Comput..

[20]  Ricard Borrell,et al.  Parallel mesh partitioning based on space filling curves , 2018, Computers & Fluids.

[21]  A. Oliva,et al.  On the evolution of flow topology in turbulent Rayleigh-Bénard convection , 2016 .

[22]  Assensi Oliva,et al.  Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers , 2017 .

[23]  Andrey Gorobets,et al.  Strategies for the heterogeneous execution of large-scale simulations on hybrid supercomputers , 2018 .

[24]  Asensio Oliva Llena,et al.  Algebraic implementation of a flux limiter for heterogeneous computing , 2018 .

[25]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  F. Xavier Trias,et al.  Symmetry-preserving discretization of Navier-Stokes equations on collocated unstructured grids , 2014, J. Comput. Phys..

[27]  F. X. Trias,et al.  An energy-preserving level set method for multiphase flows , 2020, J. Comput. Phys..

[28]  Michael Griebel,et al.  Solving incompressible two-phase flows on multi-GPU clusters , 2013 .

[29]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[30]  H. T. Huynh,et al.  A Flux Reconstruction Approach to High-Order Schemes Including Discontinuous Galerkin Methods , 2007 .

[31]  X. Álvarez,et al.  HPC2—A fully-portable, algebra-based framework for heterogeneous computing. Application to CFD , 2018 .

[32]  Ricard Borrell,et al.  Efficient CFD code implementation for the ARM-based Mont-Blanc architecture , 2018, Future Gener. Comput. Syst..