Performance Optimization of 3D Lattice Boltzmann Flow Solver on a GPU

Lattice Boltzmann Method (LBM) is a powerful numerical simulation method of the fluid flow. With its data parallel nature, it is a promising candidate for a parallel implementation on a GPU. The LBM, however, is heavily data intensive and memory bound. In particular, moving the data to the adjacent cells in the streaming computation phase incurs a lot of uncoalesced accesses on the GPU which affects the overall performance. Furthermore, the main computation kernels of the LBM use a large number of registers per thread which limits the thread parallelism available at the run time due to the fixed number of registers on the GPU. In this paper, we develop high performance parallelization of the LBM on a GPU by minimizing the overheads associated with the uncoalesced memory accesses while improving the cache locality using the tiling optimization with the data layout change. Furthermore, we aggressively reduce the register uses for the LBM kernels in order to increase the run-time thread parallelism. Experimental results on the Nvidia Tesla K20 GPU show that our approach delivers impressive throughput performance: 1210.63 Million Lattice Updates Per Second (MLUPS).

[1]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[2]  Alfio Quarteroni,et al.  A modular lattice boltzmann solver for GPU computing processors , 2012 .

[3]  Gerhard Wellein,et al.  On the single processor performance of simple lattice Boltzmann kernels , 2006 .

[4]  Jonas Tölke,et al.  Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA , 2009, Comput. Vis. Sci..

[5]  Alistair J. Revell,et al.  Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs , 2013, Comput. Phys. Commun..

[6]  Raffaele Tripiccione,et al.  Early Experience on Porting and Running a Lattice Boltzmann Code on the Xeon-Phi Co-Processor , 2013, ICCS.

[7]  Marcelo J. Vénere,et al.  A Lattice-Boltzmann solver for 3D fluid simulation on GPU , 2012, Simul. Model. Pract. Theory.

[8]  Joachim Wilke,et al.  Cache Optimizations for the Lattice Boltzmann Method in 2D , 2003 .

[9]  Gerhard Wellein,et al.  Comparison of different propagation steps for lattice Boltzmann methods , 2011, Comput. Math. Appl..

[10]  Nhat-Phuong Tran,et al.  Memory-Efficient Parallelization of 3D Lattice Boltzmann Flow Solver on a GPU , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[11]  Peter Bailey,et al.  Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors , 2009, 2009 International Conference on Parallel Processing.

[12]  Manfred Krafczyk,et al.  TeraFLOP computing on a desktop PC with GPUs for 3D CFD , 2008 .

[13]  SPEC CPU 2006 Benchmark Descriptions , 2006 .

[14]  Bernard Tourancheau,et al.  A new approach to the lattice Boltzmann method for graphics processing units , 2011, Comput. Math. Appl..

[15]  Gerhard Wellein,et al.  Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA , 2011, Adv. Eng. Softw..

[16]  Jean-Pierre Rivet,et al.  Lattice Gas Hydrodynamics , 1987 .

[17]  Ulrich Rüde,et al.  Fluid flow simulation on the Cell Broadband Engine using the lattice Boltzmann method , 2009, Comput. Math. Appl..