Performance comparison of different parallel lattice Boltzmann implementations on multi-core multi-socket systems

In this report, we discuss the performance behaviour of different parallel lattice Boltzmann implementations. In previous works, we already proposed a fast serial implementation and a cache oblivious spatial and temporal blocking algorithm for the lattice Boltzmann method (LBM) in three spatial dimensions. The cache oblivious update scheme has originally been proposed by Frigo et al. The main idea is to provide maximum performance results for stencil-based methods by dividing the space-time domain in an optimal way, independently of any external parameters, such as cache size. In view of the increasing gap between processor speed and memory performance, this approach offers a promising path to increase cache utilisation. We present results for the shared memory parallelisation of the cache oblivious implementation based on task queueing in comparison to the iterative standard implementation, thereby focusing on the special issues for multi-core and multi-socket systems.

[1]  Markus Kowarschik,et al.  Data locality optimizations for iterative numerical algorithms and cellular automata on hierarchical memory architectures , 2004, Advances in simulation.

[2]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[3]  Gerhard Wellein,et al.  Optimizing performance on modern HPC systems: learning from simple kernel benchmarks , 2006 .

[4]  Harihar Rajaram,et al.  Accuracy and Computational Efficiency in 3D Dispersion via Lattice-Boltzmann: Models for Dispersion in Rough Fractures and Double-Diffusive Fingering , 1998 .

[5]  Ulrich Rüde,et al.  Optimization and Profiling of the Cache Performance of Parallel Lattice Boltzmann Codes in 2 D and 3 D ∗ , 2003 .

[6]  Ernst Rank,et al.  Parallelization Strategies and Efficiency of CFD Computations in Complex Geometries Using Lattice Boltzmann Methods on High-Performance Computers , 2002 .

[7]  Gerhard Wellein,et al.  On the single processor performance of simple lattice Boltzmann kernels , 2006 .

[8]  Jacques Periaux,et al.  Parallel Computational Fluid Dynamics 2005: Theory and Applications , 2006 .

[9]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[10]  Gerhard Wellein,et al.  Towards Optimal Performance for Lattice Boltzmann Applications on Terascale Computers , 2006 .

[11]  G. Wellein,et al.  Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method , 2008 .

[12]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[13]  J. Boon The Lattice Boltzmann Equation for Fluid Dynamics and Beyond , 2003 .