Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU-GPU clusters

Abstract Computational fluid dynamic simulations are in general very compute intensive. Only by parallel simulations on modern supercomputers the computational demands of complex simulation tasks can be satisfied. Facing these computational demands GPUs offer high performance, as they provide the high floating point performance and memory to processor chip bandwidth. To successfully utilize GPU clusters for the daily business of a large community, usable software frameworks must be established on these clusters. The development of such software frameworks is only feasible with maintainable software designs that consider performance as a design objective right from the start. For this work we extend the software design concepts to achieve more efficient and highly scalable multi-GPU parallelization within our software framework waLBerla for multi-physics simulations centered around the lattice Boltzmann method. Our software designs now also support a pure-MPI and a hybrid parallelization approach capable of heterogeneous simulations using CPUs and GPUs in parallel. For the first time weak and strong scaling performance results obtained on the Tsubame 2.0 cluster for more than 1000 GPUs are presented using waLBerla. With the help of a new communication model the parallel efficiency of our implementation is investigated and analyzed in a detailed and structured performance analysis. The suitability of the waLBerla framework for production runs on large GPU clusters is demonstrated. As one possible application we show results of strong scaling experiments for flows through a porous medium.

[1]  Adolfy Hoisie,et al.  Performance Optimization of Numerically Intensive Codes , 1987 .

[2]  Ge Wei,et al.  Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units , 2012 .

[3]  Gerhard Wellein,et al.  On the single processor performance of simple lattice Boltzmann kernels , 2006 .

[4]  Ulrich Rüde,et al.  Lehrstuhl Für Informatik 10 (systemsimulation) Walberla: Hpc Software Design for Computational Engineering Simulations Walberla: Hpc Software Design for Computational Engineering Simulations , 2010 .

[5]  J. Boon The Lattice Boltzmann Equation for Fluid Dynamics and Beyond , 2003 .

[6]  Gerhard Wellein,et al.  Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results , 2011, ArXiv.

[7]  Ulrich Rüde,et al.  Direct Numerical Simulation of Particulate Flows on 294912 Processor Cores , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Bernard Tourancheau,et al.  Author Manuscript, Published in "computers and Mathematics with Applications (2010)" a New Approach to the Lattice Boltzmann Method for Graphics Processing Units , 2011 .

[9]  Gerhard Wellein,et al.  Comparison of different propagation steps for lattice Boltzmann methods , 2011, Comput. Math. Appl..

[10]  Manfred Krafczyk,et al.  TeraFLOP computing on a desktop PC with GPUs for 3D CFD , 2008 .

[11]  Ulrich Rüde,et al.  Verification of surface tension in the parallel free surface lattice Boltzmann method in waLBerla , 2011 .

[12]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[13]  Ulrich Rüde,et al.  A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters , 2010, Parallel Comput..

[14]  Christian Feichtinger,et al.  Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers , 2012 .

[15]  Shiyi Chen,et al.  LATTICE BOLTZMANN METHOD FOR FLUID FLOWS , 2001 .

[16]  Ulrich Rüde,et al.  WaLBerla: Exploiting Massively Parallel Systems for Lattice Boltzmann Simulations , 2009 .

[17]  Takayuki Aoki,et al.  Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster , 2011, Parallel Comput..

[18]  Cyrus K. Aidun,et al.  Lattice-Boltzmann Method for Complex Flows , 2010 .

[19]  L. Luo,et al.  Lattice Boltzmann Model for the Incompressible Navier–Stokes Equation , 1997 .

[20]  Aoki Takayuki,et al.  Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster , 2011, ParCo 2011.

[21]  Ulrich Rüde,et al.  All good things come in threes - Three beads learn to swim with lattice Boltzmann and a rigid body solver , 2011, J. Comput. Sci..

[22]  Li-Shi Luo,et al.  Lattice Boltzmann Model for the Incompressible , 1997 .

[23]  Gerhard Wellein,et al.  Benchmark Analysis and Application Results for Lattice Boltzmann Simulations on NEC SX Vector and Intel Nehalem Systems , 2009, Parallel Process. Lett..