Chip‐level and multi‐node analysis of energy‐optimized lattice Boltzmann CFD simulations

Memory‐bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice Boltzmann method on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single‐chip performance and power characteristics can be generalized to the highly parallel case. First, we perform an analysis of a sparse‐lattice lattice Boltzmann method implementation for complex geometries. Using a single‐core performance model, we predict the intra‐chip saturation characteristics and the optimal operating point in terms of energy‐to‐solution as a function of implementation details, clock frequency, vectorization, and number of active cores per chip. We show that high single‐core performance and a correct choice of the number of active cores per chip are the essential optimizations for the lowest energy‐to‐solution at minimal performance degradation. Then we extrapolate to the Message Passing Interface (MPI)‐parallel level and quantify the energy‐saving potential of various optimizations and execution modes, where we find these guidelines to be even more important, especially when communication overhead is non‐negligible. In our setup, we could achieve energy savings of 35% in this case, compared with a naive approach. We also demonstrate that a simple non‐reflective reduction of the clock speed leaves most of the energy‐saving potential unused. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  Robert Schöne,et al.  Memory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86_64 Processors , 2012, HotPower.

[2]  Arndt Bode Energy to Solution: A New Mission for Parallel Computing , 2013, Euro-Par.

[3]  Leonid Oliker,et al.  Magnetohydrodynamic Turbulence Simulations on the Earth Simulator Using the Lattice Boltzmann Method , 2005 .

[4]  Gerhard Wellein,et al.  Benchmark Analysis and Application Results for Lattice Boltzmann Simulations on NEC SX Vector and Intel Nehalem Systems , 2009, Parallel Process. Lett..

[5]  Gerhard Wellein,et al.  Pushing the limits for medical image reconstruction on recent standard multicore processors , 2011, Int. J. High Perform. Comput. Appl..

[6]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[7]  Gerhard Wellein,et al.  Comparison of different propagation steps for lattice Boltzmann methods , 2011, Comput. Math. Appl..

[8]  Cass T. Miller,et al.  A high-performance lattice Boltzmann implementation to model flow in porous media , 2003 .

[9]  Gerhard Wellein,et al.  Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[10]  François Bertrand,et al.  On improving the performance of large parallel lattice Boltzmann flow simulations in heterogeneous porous media , 2010 .

[11]  S. Roller,et al.  A fully distributed CFD framework for massively parallel systems , 2012 .

[12]  Gerhard Wellein,et al.  On the single processor performance of simple lattice Boltzmann kernels , 2006 .

[13]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[14]  Massimo Bernaschi,et al.  MUPHY: A parallel high performance MUlti PHYsics/Scale code , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[15]  Thomas Zeiser,et al.  Performance evaluation of a parallel sparse lattice Boltzmann solver , 2008, J. Comput. Phys..

[16]  Stephen W. Poole,et al.  Towards efficient supercomputing: searching for the right efficiency metric , 2012, ICPE '12.

[17]  D. d'Humières,et al.  Two-relaxation-time Lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions , 2008 .

[18]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[19]  Ernst Rank,et al.  Parallelization Strategies and Efficiency of CFD Computations in Complex Geometries Using Lattice Boltzmann Methods on High-Performance Computers , 2002 .

[20]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[21]  Xiaoxian Zhang,et al.  Domain-decomposition method for parallel lattice Boltzmann simulation of incompressible flow in porous media. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  Dong Li,et al.  Strategies for Energy-Efficient Resource Management of Hybrid Programming Models , 2013, IEEE Transactions on Parallel and Distributed Systems.

[23]  Peter Bailey,et al.  Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors , 2009, 2009 International Conference on Parallel Processing.

[24]  Massimo Bernaschi,et al.  Multiscale Simulation of Cardiovascular flows on the IBM Bluegene/P: Full Heart-Circulation System at Red-Blood Cell Resolution , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Gerhard Wellein,et al.  Exploring performance and power properties of modern multi‐core chips via simple machine models , 2012, Concurr. Comput. Pract. Exp..

[26]  Tuomo Rossi,et al.  Comparison of implementations of the lattice-Boltzmann method , 2008, Comput. Math. Appl..

[27]  Efraim Rotem,et al.  Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge , 2012, IEEE Micro.

[28]  Georg Hager,et al.  Introducing a Performance Model for Bandwidth-Limited Loop Kernels , 2009, PPAM.

[29]  Samuel Williams,et al.  Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..

[30]  Constantine Bekas,et al.  A new energy aware performance metric , 2010, Computer Science - Research and Development.