Scalable lattice Boltzmann solvers for CUDA GPU clusters

The lattice Boltzmann method (LBM) is an innovative and promising approach in computational fluid dynamics. From an algorithmic standpoint it reduces to a regular data parallel procedure and is therefore well-suited to high performance computations. Numerous works report efficient implementations of the LBM for the GPU, but very few mention multi-GPU versions and even fewer GPU cluster implementations. Yet, to be of practical interest, GPU LBM solvers need to be able to perform large scale simulations. In the present contribution, we describe an efficient LBM implementation for CUDA GPU clusters. Our solver consists of a set of MPI communication routines and a CUDA kernel specifically designed to handle three-dimensional partitioning of the computation domain. Performance measurement were carried out on a small cluster. We show that the results are satisfying, both in terms of data throughput and parallelisation efficiency.

[1]  Jack Dongarra,et al.  Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures , 2011 .

[2]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[3]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[4]  CongDuc Pham,et al.  A Software Suite for High-Performance Communications on Clusters of SMPs , 2002, Cluster Computing.

[5]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[6]  Aoki Takayuki,et al.  Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster , 2011, ParCo 2011.

[7]  Manfred Krafczyk,et al.  TeraFLOP computing on a desktop PC with GPUs for 3D CFD , 2008 .

[8]  Douglas Crockford,et al.  The application/json Media Type for JavaScript Object Notation (JSON) , 2006, RFC.

[9]  L. Luo,et al.  Theory of the lattice Boltzmann method: From the Boltzmann equation to the lattice Boltzmann equation , 1997 .

[10]  Bernard Tourancheau,et al.  The TheLMA project: Multi-GPU implementation of the lattice Boltzmann method , 2011, Int. J. High Perform. Comput. Appl..

[11]  Bernard Tourancheau,et al.  Global Memory Access Modelling for Efficient Implementation of the LBM on GPUs , 2011 .

[12]  Arie E. Kaufman,et al.  Implementing lattice Boltzmann computation on graphics hardware , 2003, The Visual Computer.

[13]  Bernard Tourancheau,et al.  A new approach to the lattice Boltzmann method for graphics processing units , 2011, Comput. Math. Appl..

[14]  Takayuki Aoki,et al.  Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster , 2011, Parallel Comput..

[15]  Zhe Fan,et al.  [IEEE ACM/IEEE SC2004 Conference - Pittsburgh, PA, USA (06-12 Nov. 2004)] Proceedings of the ACM/IEEE SC2004 Conference - GPU Cluster for High Performance Computing , 2004 .

[16]  D. d'Humières,et al.  Multiple–relaxation–time lattice Boltzmann models in three dimensions , 2002, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[17]  H. Kuhlmann,et al.  Accurate three-dimensional lid-driven cavity flow , 2005 .

[18]  P. Geoffray,et al.  BIP-SMP : High Performance Message Passing over a Cluster of Commodity SMPs , 1999, ACM/IEEE SC 1999 Conference (SC'99).