Two-level parallelization of a fluid mechanics algorithm exploiting hardware heterogeneity

Abstract The prospect of wildly heterogeneous computer systems has led to a renewed discussion of programming approaches in high-performance computing, of which computational fluid dynamics is a major field. The challenge consists in harvesting the performance of all available hardware components while retaining good programmability. In particular the use of graphic cards is an important trend. This is addressed in the present paper by devising a hybrid programming model to create a heterogeneous data-parallel computation with a single source code. The concept is demonstrated for a one-dimensional spectral-element discretization of a fluid dynamics problem. To exploit the additional hardware available when coupling GPGPU-accelerated processes with excess CPU cores, a straight-forward load balancing model for such heterogeneous environments is developed. The paper presents a large number of run time measurements and demonstrates that the achieved performance gains are close to optimal. This provides valuable information for the implementation of fluid dynamics codes on modern heterogeneous hardware.

[1]  Timothy C. Warburton,et al.  Nodal discontinuous Galerkin methods on graphics processors , 2009, J. Comput. Phys..

[2]  S. Sherwin,et al.  From h to p efficiently: optimal implementation strategies for explicit time-dependent problems using the spectral/hp element method , 2014, International journal for numerical methods in fluids.

[3]  N. Peters,et al.  Discussion of Test Problem A , 1982 .

[4]  Claude Basdevant,et al.  Optimizing 2D and 3D structured Euler CFD solvers on Graphical Processing Units , 2012 .

[5]  Jack Dongarra,et al.  Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters , 2013 .

[6]  Rupak Biswas,et al.  High performance computing using MPI and OpenMP on multi-core parallel systems , 2011, Parallel Comput..

[7]  Jochen Fröhlich,et al.  An improved immersed boundary method with direct forcing for the simulation of particle laden flows , 2012, J. Comput. Phys..

[8]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[9]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[10]  Satoshi Matsuoka,et al.  CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[11]  T. Poinsot,et al.  Theoretical and numerical combustion , 2001 .

[12]  A. Patera A spectral element method for fluid dynamics: Laminar flow in a channel expansion , 1984 .

[13]  G. Karniadakis,et al.  Spectral/hp Element Methods for CFD , 1999 .

[14]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[15]  Robert Strzodka,et al.  Exploring weak scalability for FEM calculations on a GPU-enhanced cluster , 2007, Parallel Comput..

[16]  Rolf Dach,et al.  Technical Report 2012 , 2013 .

[17]  Willem Hundsdorfer,et al.  Partially Implicit BDF2 Blends for Convection Dominated Flows , 2000, SIAM J. Numer. Anal..

[18]  Yi Jiang,et al.  Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer , 2014, J. Comput. Phys..

[19]  Boris Štok,et al.  Parallel computing with load balancing on heterogeneous distributed systems , 2003 .

[20]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[21]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[22]  Anne E. Trefethen,et al.  Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems , 2013, Parallel Comput..

[23]  P. Fischer,et al.  High-Order Methods for Incompressible Fluid Flow , 2002 .

[24]  Gordon Erlebacher,et al.  High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster , 2010, J. Comput. Phys..