Large-scale parallelization based on CPU and GPU cluster for cosmological fluid simulations

In this study, we present our parallel implementation for large-scale cosmological simulations of 3D supersonic fluids based on CPU and GPU clusters. Our developments are based on an OpenMP parallelized CPU code named WIGEON. It is shown that a speedup of 13~31 (depending on the specific GPU card) can be achieved compared to the sequential Fortran code by using the GPU as the accelerator. Further more, our results show that the pure MPI parallelization scales very well up to ten thousand CPU cores. In addition, a hybrid CPU/GPU parallelization scheme is introduced and a detailed analysis of the speedup and the scaling on the different number of CPU and GPU cards are presented (up to 256 GPU cards due to computing resource limitation). The efficiency of our scaling and high speedup relies on domain decomposition approach, optimization of the WENO algorithm and a series of techniques to optimize the CUDA implementation, especially in the memory access pattern. We believe this hybrid MPI+CUDA code can be an excellent candidate for 10 Peta-scale computing and beyond.

[1]  Technology of China,et al.  A Hybrid Cosmological Hydrodynamic/N-Body Code Based on a Weighted Essentially Nonoscillatory Scheme , 2004 .

[2]  Chi-Wang Shu,et al.  Monotonicity Preserving Weighted Essentially Non-oscillatory Schemes with Increasingly High Order of Accuracy , 2000 .

[3]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[4]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[5]  Romain Teyssier,et al.  Accelerating Euler Equations Numerical Solver on Graphics Processing Units , 2010, ICA3PP.

[6]  Thomas B. Gatski,et al.  A massively parallel hybrid scheme for direct numerical simulation of turbulent viscoelastic channel flow , 2011 .

[7]  Chi-Wang Shu,et al.  Efficient Implementation of Weighted ENO Schemes , 1995 .

[8]  Dimitris Drikakis,et al.  Higher-order CFD and interface tracking methods on highly-Parallel MPI and GPU systems , 2011 .

[9]  Chi-Wang Shu Total-variation-diminishing time discretizations , 1988 .

[10]  A. D. Young,et al.  An Introduction to Fluid Mechanics , 1968 .

[11]  Michael Griebel,et al.  A multi-GPU accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations , 2010, Computer Science - Research and Development.

[12]  Manfred Krafczyk,et al.  TeraFLOP computing on a desktop PC with GPUs for 3D CFD , 2008 .

[13]  Konstantinos I. Karantasis,et al.  Acceleration of a Finite-Difference WENO Scheme for Large-Scale Simulations on Many-Core Architectures , 2010 .

[14]  Herng Lin,et al.  Parallel preconditioned WENO scheme for three-dimensional flow simulation of NREL Phase VI Rotor , 2011 .