An analysis of energy-optimized lattice-Boltzmann CFD simulations from the chip to the highly parallel level

The lattice-Boltzmann method (LBM) is an algorithm for CFD simulations that has gained popularity due to its ease of implementation and suitability for complex geometries. Its scalability on multicore chips is often limited due to its low computational intensity, leading to interesting characteristics regarding optimal performance and energy to solution on the chip and highly parallel levels. In this paper we perform a thorough analysis of a two-relaxationtime (TRT) model in a sparse lattice representation on the Intel Sandy Bridge processor. Starting from a single-core performance model we can describe the intra-chip saturation characteristics of the implementation and its optimal operating point in terms of energy to solution as a function of the propagation method, the clock frequency, and the SIMD vectorization. We then show if and how these findings may be extrapolated to the massively parallel level on a petascale-class machine, and quantify the energy-saving potential of various optimizations.

[1]  Cass T. Miller,et al.  A high-performance lattice Boltzmann implementation to model flow in porous media , 2003 .

[2]  Dietmar Fey,et al.  A Predictive Performance Model for Stencil Codes on Multicore CPUs , 2012, VECPAR.

[3]  Gerhard Wellein,et al.  Benchmark Analysis and Application Results for Lattice Boltzmann Simulations on NEC SX Vector and Intel Nehalem Systems , 2009, Parallel Process. Lett..

[4]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[5]  Tuomo Rossi,et al.  Comparison of implementations of the lattice-Boltzmann method , 2008, Comput. Math. Appl..

[6]  Robert Schöne,et al.  Memory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86_64 Processors , 2012, HotPower.

[7]  Samuel Williams,et al.  Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..

[8]  Massimo Bernaschi,et al.  Multiscale Simulation of Cardiovascular flows on the IBM Bluegene/P: Full Heart-Circulation System at Red-Blood Cell Resolution , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Efraim Rotem,et al.  Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge , 2012, IEEE Micro.

[10]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[11]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[12]  Gerhard Wellein,et al.  Pushing the limits for medical image reconstruction on recent standard multicore processors , 2011, Int. J. High Perform. Comput. Appl..

[13]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Gerhard Wellein,et al.  Comparison of different propagation steps for lattice Boltzmann methods , 2011, Comput. Math. Appl..

[15]  S. Roller,et al.  A fully distributed CFD framework for massively parallel systems , 2012 .

[16]  D. d'Humières,et al.  Two-relaxation-time Lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions , 2008 .

[17]  Gerhard Wellein,et al.  Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[18]  Leonid Oliker,et al.  Magnetohydrodynamic Turbulence Simulations on the Earth Simulator Using the Lattice Boltzmann Method , 2005 .

[19]  Peter Bailey,et al.  Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors , 2009, 2009 International Conference on Parallel Processing.

[20]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[21]  Ernst Rank,et al.  Parallelization Strategies and Efficiency of CFD Computations in Complex Geometries Using Lattice Boltzmann Methods on High-Performance Computers , 2002 .

[22]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[23]  François Bertrand,et al.  On improving the performance of large parallel lattice Boltzmann flow simulations in heterogeneous porous media , 2010 .

[24]  Gerhard Wellein,et al.  On the single processor performance of simple lattice Boltzmann kernels , 2006 .

[25]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[26]  Massimo Bernaschi,et al.  MUPHY: A parallel high performance MUlti PHYsics/Scale code , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[27]  Xiaoxian Zhang,et al.  Domain-decomposition method for parallel lattice Boltzmann simulation of incompressible flow in porous media. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  Dong Li,et al.  Strategies for Energy-Efficient Resource Management of Hybrid Programming Models , 2013, IEEE Transactions on Parallel and Distributed Systems.