A General Design for a Scalable MPI-GPU Multi-Resolution 2D Numerical Solver

This article presents a multi-GPU implementation of a Finite-Volume solver on a multi-resolution grid. The implementation completely offloads the computation to the GPUs and communications between different GPUs are implemented by means of the Message Passing Interface (MPI) API. Different domain decomposition techniques have been considered and the one based on the Hilbert Space Filling Curves (HSFC) showed optimal scalability. Several optimizations are introduced: One-to-one MPI communications among MPI ranks are completely masked by GPU computations on internal cells and a novel dynamic load balancing algorithm is introduced to minimize the waiting times at global MPI synchronization barriers. Such algorithm adapts the computational load of ranks in response to dynamical changes in the execution time of blocks and in network performances; Its capability to converge to a balanced computation has been empirically shown by numerical experiments. Tests exploit up to 64 GPUs and 83M cells and achieve an efficiency of 90 percent in weak scalability and 85 percent for strong scalability. The framework is general and the results of the article can be ported to a wide range of explicit 2D Partial Differential Equations solvers.

[1]  Devin W. Silvia,et al.  ENZO: AN ADAPTIVE MESH REFINEMENT CODE FOR ASTROPHYSICS , 2013, 1307.2265.

[2]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[3]  Moncho Gómez-Gesteira,et al.  New multi-GPU implementation for smoothed particle hydrodynamics on heterogeneous clusters , 2013, Comput. Phys. Commun..

[4]  Mark F. Adams,et al.  Chombo Software Package for AMR Applications Design Document , 2014 .

[5]  M. Zingale,et al.  Meeting the Challenges of Modeling Astrophysical Thermonuclear Explosions: Castro, Maestro, and the AMReX Astrophysics Suite , 2017, 1711.06203.

[6]  Brett F. Sanders,et al.  ParBreZo: A parallel, unstructured grid, Godunov-type, shallow-water code for high-resolution flood inundation modeling at the regional scale , 2010 .

[7]  Alessandro Dal Palù,et al.  Multi-GPU Implementation of 2D Shallow Water Equation Code with Block Uniform Quad-Tree Grids , 2018 .

[8]  G. Petaccia,et al.  OpenMP and CUDA simulations of Sella Zerbino Dam break on unstructured grids , 2016, Computational Geosciences.

[9]  R. LeVeque Finite Volume Methods for Hyperbolic Problems: Characteristics and Riemann Problems for Linear Hyperbolic Equations , 2002 .

[10]  José Miguel Mantas,et al.  An MPI-CUDA implementation of an improved Roe method for two-layer shallow water systems , 2012, J. Parallel Distributed Comput..

[11]  Scott B. Baden,et al.  Effective multi-GPU communication using multiple CUDA streams and threads , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[12]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[13]  R. Lamb,et al.  A fast two-dimensional floodplain inundation model , 2009 .

[14]  Carsten Burstedde,et al.  p4est: Scalable Algorithms for Parallel Adaptive Mesh Refinement on Forests of Octrees , 2011, SIAM J. Sci. Comput..

[15]  E. Toro Riemann Solvers and Numerical Methods for Fluid Dynamics , 1997 .

[16]  Gordon Erlebacher,et al.  High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster , 2010, J. Comput. Phys..

[17]  Jesús Labarta,et al.  Dynamic load balance applied to particle transport in fluids , 2016 .

[18]  Alessandro Dal Palù,et al.  GPU-enhanced Finite Volume Shallow Water solver for fast flood simulations , 2014, Environ. Model. Softw..

[19]  E. Toro,et al.  Restoration of the contact surface in the HLL-Riemann solver , 1994 .

[20]  Alessandro Dal Palù,et al.  A non-uniform efficient grid type for GPU-parallel Shallow Water Equations models , 2017, Environ. Model. Softw..