Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384x384x384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3-4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8-96 GPUs, the performances increase by a factor about 1.1-1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re=13,000 was carried on successfully using the mesh system 2000x1000x1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0h were used for processing 100,000 time steps. Under this condition, the computational time (2.79h) and the data communication time (3.06h) are almost the same.

[1]  Diego Rossinelli,et al.  GPU accelerated simulations of bluff body flows using vortex particle methods , 2010, J. Comput. Phys..

[2]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[3]  Michihisa Tsutahara,et al.  Three-dimensional lattice Boltzmann simulations of droplet formation in a cross-junction microchannel , 2008 .

[4]  Ulrich Rüde,et al.  Performance Evaluation of Parallel Large-Scale Lattice Boltzmann Applications on Three Supercomputing Architectures , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[5]  Raoyang Zhang,et al.  Lattice Boltzmann method for simulations of liquid-vapor thermal flows. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[7]  Thomas Zeiser,et al.  Performance evaluation of a parallel sparse lattice Boltzmann solver , 2008, J. Comput. Phys..

[8]  Shiyi Chen,et al.  LATTICE BOLTZMANN METHOD FOR FLUID FLOWS , 2001 .

[9]  Cyrus K. Aidun,et al.  Lattice-Boltzmann Method for Complex Flows , 2010 .

[10]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[11]  H. Kuhlmann,et al.  Accurate three-dimensional lid-driven cavity flow , 2005 .

[12]  Peter V. Coveney,et al.  HemeLB: A high performance parallel lattice-Boltzmann code for large scale fluid flow in complex geometries , 2008, Comput. Phys. Commun..

[13]  Manfred Krafczyk,et al.  TeraFLOP computing on a desktop PC with GPUs for 3D CFD , 2008 .

[14]  Bruce J. Palmer,et al.  Lattice Boltzmann Algorithm for Simulating Thermal Flow in Compressible Fluids , 2000 .

[15]  Shiyi Chen,et al.  Lattice Boltzmann computations for reaction‐diffusion equations , 1993 .

[16]  P. Bhatnagar,et al.  A Model for Collision Processes in Gases. I. Small Amplitude Processes in Charged and Neutral One-Component Systems , 1954 .

[17]  Gerhard Wellein,et al.  On the single processor performance of simple lattice Boltzmann kernels , 2006 .

[18]  J. Kulpa,et al.  Time-frequency analysis using NVIDIA compute unified device architecture (CUDA) , 2009, Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA).

[19]  Sauro Succi,et al.  Multiscale lattice Boltzmann schemes with turbulence modeling , 2001 .

[20]  Masato Yoshino,et al.  A numerical method for incompressible non-Newtonian fluid flows based on the lattice Boltzmann method , 2007 .

[21]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[22]  Gerhard Wellein,et al.  Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based Architectures , 2005 .

[23]  Johannes Habich,et al.  Performance Evaluation of Numeric Compute Kernels on nVIDIA GPUs , 2008 .