GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism

This paper presents new advances in GPU-driven Fourier pseudo-spectral numerical algorithms, which allow the simulation of turbulent fluid flow at problem sizes beyond the current state of the art. In contrast to several massively parallel petascale systems, the dense nodes of Summit, Sierra, and expected exascale machines can be exploited with coarser MPI decompositions which result in improved MPI all-to-all scaling. An asynchronous batching strategy, combined with the fast hardware connection between the large CPU memory and the fast GPUs allows effective use of the GPUs on problem sizes which are too large to reside in GPU memory. Communication performance is further improved by a hybrid MPI+OpenMP approach. Favorable performance is obtained up to a 184323 problem size on 3072 nodes of Summit, with a GPU to CPU speedup of 4.7 for a 122883 problem size (the largest problem size previously published in turbulence literature).

[1]  R. Rogallo Numerical experiments in homogeneous turbulence , 1981 .

[2]  T. A. Zang,et al.  Spectral methods for fluid dynamics , 1987 .

[3]  S. Pope Turbulent Flows: FUNDAMENTALS , 2000 .

[4]  Tong Liu,et al.  The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications , 2011, Computer Science - Research and Development.

[5]  Y. Raghu Reddy,et al.  A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence , 2010, Parallel Comput..

[6]  Massimiliano Fatica,et al.  CUDA Fortran for Scientists and Engineers , 2012 .

[7]  Richard W. Vuduc,et al.  On the communication complexity of 3D FFTs and its implications for Exascale , 2012, ICS '12.

[8]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[9]  Myoungkyu Lee,et al.  Petascale direct numerical simulation of turbulent channel flow on up to 786K cores , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Massimiliano Fatica,et al.  Chapter 17 – CUDA FORTRAN , 2013 .

[11]  Dhairya Malhotra,et al.  AccFFT: A library for distributed-memory FFT on CPU and GPU architectures , 2015, ArXiv.

[12]  Katepalli R Sreenivasan,et al.  Extreme events in computational turbulence , 2015, Proceedings of the National Academy of Sciences.

[13]  Mitsuo Yokokawa,et al.  Energy spectrum in high-resolution direct numerical simulations of turbulence , 2016 .

[14]  James J. Riley,et al.  Turbulent/non-turbulent interfaces in wakes in stably stratified fluids , 2016, Journal of Fluid Mechanics.

[15]  Ibm Power Npu team Functionality and performance of NVLink with IBM POWER9 processors , 2018, IBM J. Res. Dev..

[16]  Bilel Hadri,et al.  Scaling of a Fast Fourier Transform and a pseudo-spectral fluid solver up to 196608 cores , 2018, J. Parallel Distributed Comput..

[17]  M. P. Clay,et al.  GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5 , 2018, Comput. Phys. Commun..

[18]  David E. Keyes,et al.  Fast parallel multidimensional FFT using advanced MPI , 2018, J. Parallel Distributed Comput..

[19]  K. Carlson,et al.  Turbulent Flows , 2020, Finite Analytic Method in Flows and Heat Transfer.