Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

Abstract This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on gpu  hardware using single precision. The simulations use a vortex particle method to solve the Navier–Stokes equations, with a highly parallel fast multipole method ( fmm ) as numerical engine, and match the current record in mesh size for this application, a cube of 4096 3 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the fft  algorithm as the numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the fmm -based vortex method achieving 74% parallel efficiency on 4096 processes (one gpu  per mpi  process, 3 gpu s per node of the tsubame -2.0 system). The fft -based spectral method is able to achieve just 14% parallel efficiency on the same number of mpi  processes (using only cpu  cores), due to the all-to-all communication pattern of the fft  algorithm. The calculation time for one time step was 108 s for the vortex method and 154 s for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex-method calculations to date.

[1]  Petros Koumoutsakos,et al.  Vortex Methods: Theory and Practice , 2000 .

[2]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[3]  J. Monaghan,et al.  Smoothed particle hydrodynamics: Theory and application to non-spherical stars , 1977 .

[4]  Mitsuo Yokokawa,et al.  16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[5]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[6]  Walter Dehnen,et al.  A Hierarchical O(N) Force Calculation Algorithm , 2002 .

[7]  Yukio Kaneda,et al.  A Voyage Through Turbulence: List of contributors , 2011 .

[8]  Tsuyoshi Hamada,et al.  190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  L. Greengard,et al.  Regular Article: A Fast Adaptive Multipole Algorithm in Three Dimensions , 1999 .

[10]  Michael S. Warren,et al.  Skeletons from the treecode closet , 1994 .

[11]  Tomonari Masada,et al.  A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation , 2009, Computer Science - Research and Development.

[12]  Alessandro Curioni,et al.  Billion vortex particle direct numerical simulations of aircraft wakes , 2008 .

[13]  M. S. Warren,et al.  A parallel hashed Oct-Tree N-body algorithm , 1993, Supercomputing '93.

[14]  Hari Sundar,et al.  Bottom-Up Construction and 2: 1 Balance Refinement of Linear Octrees in Parallel , 2008, SIAM J. Sci. Comput..

[15]  Shinnosuke Obi,et al.  Calculation of isotropic turbulence using a pure Lagrangian vortex method , 2007, J. Comput. Phys..

[16]  Y. Kaneda,et al.  Small-scale statistics in high-resolution direct numerical simulation of turbulence: Reynolds number dependence of one-point velocity gradient statistics , 2007, Journal of Fluid Mechanics.

[17]  Makoto Taiji,et al.  42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[18]  Michael S. Warren,et al.  Astrophysical N-body simulations using hierarchical tree data structures , 1992, Proceedings Supercomputing '92.

[19]  Matthew G. Knepley,et al.  Biomolecular electrostatics using a fast multipole BEM on up to 512 gpus and a billion unknowns , 2010, Comput. Phys. Commun..

[20]  Shinnosuke Obi,et al.  Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence , 2009, Comput. Phys. Commun..

[21]  T. Darden,et al.  A Multipole-Based Algorithm for Efficient Calculation of Forces and Potentials in Macroscopic Period , 1996 .

[22]  Rio Yokota,et al.  Treecode and Fast Multipole Method for N-Body Simulation with CUDA , 2010, 1010.1482.

[23]  D. Lin,et al.  On the motions of the Sun, the Galaxy and the Andromeda nebula , 1977 .

[24]  Richard W. Vuduc,et al.  Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Lorena A. Barba,et al.  Hierarchical N-body Simulations with Autotuning for Heterogeneous Systems , 2012, Computing in Science & Engineering.

[26]  Richard W. Vuduc,et al.  A massively parallel adaptive fast-multipole method on heterogeneous architectures , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[27]  Lorena A. Barba,et al.  Vortex Method for computing high-Reynolds number flows: Increased accuracy with a fully mesh-less formulation. , 2004 .

[28]  Rio Yokota,et al.  FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method , 2011, Computers & Fluids.