Multi-level parallelism for incompressible flow computations on GPU clusters

We investigate multi-level parallelism on GPU clusters with MPI-CUDA and hybrid MPI-OpenMP-CUDA parallel implementations, in which all computations are done on the GPU using CUDA. We explore efficiency and scalability of incompressible flow computations using up to 256GPUs on a problem with approximately 17.2 billion cells. Our work addresses some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism that use either MPI or MPI-OpenMP for communications. We present three different strategies to overlap computations with communications, and systematically assess their impact on parallel performance on two different GPU clusters. Our results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance. We also find that our tri-level MPI-OpenMP-CUDA parallel implementation does not offer a significant advantage in performance over the dual-level implementation on GPU clusters with two GPUs per node, but on clusters with higher GPU counts per node or with different domain decomposition strategies a tri-level implementation may exhibit higher efficiency than a dual-level implementation and needs to be investigated further.

[1]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[2]  U. Ghia,et al.  High-Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method , 1982 .

[3]  Inanc Senocak,et al.  A Full-Depth Amalgamated Parallel 3D Geometric Multigrid Solver for GPU Clusters , 2011 .

[4]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[5]  Robert Strzodka,et al.  Using GPUs to improve multigrid solver performance on a cluster , 2008, Int. J. Comput. Sci. Eng..

[6]  Scott M. Murman,et al.  Performance of a new CFD flow solver using a hybrid programming paradigm , 2005, J. Parallel Distributed Comput..

[7]  A. Chorin Numerical solution of the Navier-Stokes equations , 1968 .

[8]  Hiroshi Okuda,et al.  Parallel Iterative Solvers for Unstructured Grids Using an OpenMP/MPI Hybrid Programming Model for the GeoFEM Platform on SMP Cluster Architectures , 2002, ISHPC.

[9]  Ewing L. Lusk,et al.  Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.

[10]  Michael Griebel,et al.  Numerical Simulation in Fluid Dynamics: A Practical Introduction , 1997 .

[11]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[12]  Kengo Nakajima Three-level hybrid vs. flat MPI on the Earth Simulator: parallel iterative solvers for finite-element method , 2005 .

[13]  Georg Hager,et al.  Communication Characteristics and Hybrid MPI/OpenMP Parallel Pr ogramming on Clusters of Multi-core SMP Nodes , 2009 .

[14]  R. Pletcher,et al.  Computational Fluid Mechanics and Heat Transfer. By D. A ANDERSON, J. C. TANNEHILL and R. H. PLETCHER. Hemisphere, 1984. 599 pp. $39.95. , 1986, Journal of Fluid Mechanics.

[15]  Franck Cappello,et al.  Performance of the NAS Benchmarks on a Cluster of SMP PCs Using a Parallelization of the MPI Programs with OpenMP , 1999, PaCT.

[16]  Volodymyr Kindratenko,et al.  QP: A Heterogeneous Multi-Accelerator Cluster , 2011 .

[17]  Decheng Wan,et al.  A NEW BENCHMARK QUALITY SOLUTION FOR THE BUOYANCY-DRIVEN CAVITY BY DISCRETE SINGULAR CONVOLUTION , 2001 .

[18]  Vladimir Getov,et al.  Performance evaluation of hybrid parallel programming paradigms , 2004 .

[19]  Roger L. Davis,et al.  Rapid Aerodynamic Performance Prediction on a Cluster of Graphics Processing Units , 2009 .

[20]  Eric Darve,et al.  Large calculation of the flow over a hypersonic vehicle using a GPU , 2008, J. Comput. Phys..

[21]  Jonathan Cohen,et al.  Title: A Fast Double Precision CFD Code using CUDA , 2009 .

[22]  Zeki Demirbilek,et al.  Dual-Level Parallel Analysis of Harbor Wave Response Using MPI and OpenMP , 2000, Int. J. High Perform. Comput. Appl..

[23]  Inanc Senocak,et al.  CUDA Implementation of a Navier-Stokes Solver on Multi-GPU Desktop Platforms for Incompressible Flows , 2009 .

[24]  Géraud Krawezik,et al.  Performance comparison of MPI and three openMP programming styles on shared memory multiprocessors , 2003, SPAA '03.

[25]  Le N. Ly,et al.  Coastal Ocean Modeling of the U.S. West Coast with Multiblock Grid and Dual-Level Parallelism , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[26]  Franck Cappello,et al.  Investigating the performance of two programming models for clusters of SMP PCs , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[27]  D. S. Henty,et al.  Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Discrete Element Modeling , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[28]  David R. Butenhof Programming with POSIX threads , 1993 .

[29]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[30]  Rolf Rabenseifner,et al.  Communication Bandwidth of Parallel Programming Models on Hybrid Architectures , 2009, ISHPC.

[31]  Inanc Senocak,et al.  Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms , 2010, The Journal of Supercomputing.

[32]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[33]  Stefan Turek,et al.  GPU acceleration of an unmodified parallel finite element Navier-Stokes solver , 2009, 2009 International Conference on High Performance Computing & Simulation.

[34]  Inanc Senocak,et al.  Rapid-Response Urban CFD Simulations Using a GPU Computing Paradigm on Desktop Supercomputers , 2009 .

[35]  Graham Pullan,et al.  Acceleration of a 3D Euler solver using commodity graphics hardware , 2008 .

[36]  Openmp: a Proposed Industry Standard Api for Shared Memory Programming , 2022 .

[37]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Rajeev Thakur,et al.  Thread-safety in an MPI implementation: Requirements and analysis , 2007, Parallel Comput..

[39]  Dana A. Jacobsen,et al.  Methods for Multilevel Parallelism on GPU Clusters: Application to a Multigrid Accelerated Navier-Stokes Solver , 2011 .

[40]  Jesper Larsson Träff,et al.  MPI on a Million Processors , 2009, PVM/MPI.

[41]  Brice Goglin,et al.  High Throughput Intra-Node MPI Communication with Open-MX , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[42]  Suchuan Dong,et al.  Dual-Level Parallelism for Deterministic and Stochastic CFD Problems , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[43]  Tzihong Chiueh,et al.  GAMER: A GRAPHIC PROCESSING UNIT ACCELERATED ADAPTIVE-MESH-REFINEMENT CODE FOR ASTROPHYSICS , 2009, 0907.3390.

[44]  Rolf Rabenseifner,et al.  Hybrid Parallel Programming on HPC Platforms , 2003 .

[45]  T. N. Stevenson,et al.  Fluid Mechanics , 2021, Nature.