High order accurate simulation of compressible flows on GPU clusters over Software Distributed Shared Memory

Abstract The advent of multicore processors during the past decade and especially the recent introduction of many-core Graphics Processing Units (GPUs) open new horizons to large-scale, high-resolution simulations for a broad range of scientific fields. Residing at the forefront of advancements in multiprocessor technology, GPUs are often chosen as co-processors when intensive parts of applications need to be computed. Among the various domains, the scientific area of Computational Fluid Dynamics (CFD) is a potential candidate that could significantly benefit from the utilization of many-core GPUs. In order to investigate this possibility, we herein evaluate the performance of a high order accurate method for the simulation of compressible flows. Targeting computer systems with multiple GPUs, the current implementation and the respective performance evaluation are taking place on a GPU cluster. With respect to using these GPUs, this paper offers an alternative to the mainstream approach of message passing by considering shared memory abstraction. In the implementations presented in this paper, the updates on shared data are not explicitly coded by the programmer across the simulation phases, but are propagated through Software Distributed Shared Memory (SDSM). This way, we intend to preserve a unified memory view that extends the memory hierarchy from the node level to the cluster level. Such an extension could significantly facilitate the porting of multithreaded codes at GPU clusters. Our results indicate that the presented approach is competitive with the message passing paradigm and they lay grounds for further research on the use of shared memory abstraction for future GPU clusters.

[1]  G.,et al.  TOWARD THE LARGE-EDDY SIMULATION OF COMPRESSIBLE TURBULENT FLOWS , 2022 .

[2]  P. Sagaut,et al.  Subgrid-Scale Models for Large-Eddy Simulations of Compressible Wall Bounded Flows , 2000 .

[3]  Ali Khajeh-Saeed,et al.  Direct numerical simulation of turbulence using GPU accelerated supercomputers , 2013, J. Comput. Phys..

[4]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[5]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[6]  R. Rosner,et al.  On the miscible Rayleigh–Taylor instability: two and three dimensions , 2001, Journal of Fluid Mechanics.

[7]  Satoshi Matsuoka,et al.  Fast Conjugate Gradients with Multiple GPUs , 2009, ICCS.

[8]  Jack J. Dongarra,et al.  A scalable framework for heterogeneous GPU-based clusters , 2012, SPAA '12.

[9]  David W. Walker,et al.  Performance analysis of a hybrid MPI/OpenMP application on multi-core clusters , 2010, J. Comput. Sci..

[10]  Jonathan Cohen,et al.  Title: A Fast Double Precision CFD Code using CUDA , 2009 .

[11]  S. Lele Compact finite difference schemes with spectral-like resolution , 1992 .

[12]  Thomas Ertl,et al.  CUDASA: Compute Unified Device and Systems Architecture , 2008, EGPGV@Eurographics.

[13]  Inanc Senocak,et al.  CUDA Implementation of a Navier-Stokes Solver on Multi-GPU Desktop Platforms for Incompressible Flows , 2009 .

[14]  Chi-Wang Shu,et al.  Monotonicity Preserving Weighted Essentially Non-oscillatory Schemes with Increasingly High Order of Accuracy , 2000 .

[15]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[16]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[17]  Kenneth A. Hawick,et al.  Interactive visualisation of spins and clusters in regular and small-world Ising models with CUDA on GPUs , 2010, J. Comput. Sci..

[18]  Katherine Yelick,et al.  UPC: Distributed Shared-Memory Programming , 2003 .

[19]  Katherine Yelick,et al.  UPC: Distributed Shared Memory Programming (Wiley Series on Parallel and Distributed Computing) , 2005 .

[20]  Bradford L. Chamberlain,et al.  Using the High Productivity Language Chapel to Target GPGPU Architectures , 2011 .

[21]  Vivek Sarkar,et al.  JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA , 2009, Euro-Par.

[22]  Konstantinos I. Karantasis,et al.  Acceleration of a Finite-Difference WENO Scheme for Large-Scale Simulations on Many-Core Architectures , 2010 .

[23]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[24]  James R. Larus,et al.  Software and the Concurrency Revolution , 2005, ACM Queue.

[25]  F. Grinstein,et al.  Monotonically integrated large eddy simulation of free shear flows , 1999 .

[26]  Jeff Huskamp Proceedings of the 2004 ACM/IEEE conference on Supercomputing , 2004 .

[27]  F. Nicoud,et al.  Subgrid-Scale Stress Modelling Based on the Square of the Velocity Gradient Tensor , 1999 .

[28]  Takayuki Aoki,et al.  Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster , 2011, Parallel Comput..

[29]  Chi-Wang Shu,et al.  Efficient Implementation of Weighted ENO Schemes , 1995 .

[30]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[31]  Inanc Senocak,et al.  Multi-level parallelism for incompressible flow computations on GPU clusters , 2013, Parallel Comput..

[32]  Dick Epema,et al.  Proceedings of the 15th International Euro-Par Conference on Parallel Processing , 2009 .

[33]  John A. Ekaterinaris,et al.  High-order accurate, low numerical diffusion methods for aerodynamics , 2005 .

[34]  Matt Pharr,et al.  Gpu gems 2: programming techniques for high-performance graphics and general-purpose computation , 2005 .

[35]  Bernardus J. Geurts,et al.  A priori tests of large eddy simulation of the compressible plane mixing layer , 1995 .

[36]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[37]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[38]  Dimitris Drikakis,et al.  Higher-order CFD and interface tracking methods on highly-Parallel MPI and GPU systems , 2011 .

[39]  Miguel R. Visbal,et al.  On the use of higher-order finite-difference schemes on curvilinear and deforming meshes , 2002 .

[40]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[41]  Vahid Esfahanian,et al.  Assessment of WENO schemes for numerical simulation of some hyperbolic equations using GPU , 2013 .

[42]  Rainald Löhner,et al.  Running unstructured grid‐based CFD solvers on modern graphics hardware , 2011 .

[43]  P. Sagaut,et al.  Large eddy simulation of subsonic and supersonic channel flow at moderate Reynolds number , 2000 .

[44]  Timothy C. Warburton,et al.  Nodal discontinuous Galerkin methods on graphics processors , 2009, J. Comput. Phys..

[45]  Jay Hoeflinger Programming with cluster openMP , 2007, PPOPP.

[46]  Liviu Iftode,et al.  Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems , 1996, OSDI '96.

[47]  Veljko M. Milutinovic,et al.  Distributed shared memory: concepts and systems , 1997, IEEE Parallel Distributed Technol. Syst. Appl..

[48]  Eric Darve,et al.  Large calculation of the flow over a hypersonic vehicle using a GPU , 2008, J. Comput. Phys..

[49]  Torsten Hoefler,et al.  Scalable High Performance Message Passing over InfiniBand for Open MPI , 2007 .

[50]  Vijay Saraswat,et al.  GPU Programming in a High Level Language , 2011 .

[51]  Gordon Erlebacher,et al.  High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster , 2010, J. Comput. Phys..

[52]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[53]  Katherine Yelick,et al.  Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming , 2007, PPOPP.

[54]  Konstantinos I. Karantasis,et al.  Pleiad: a cross-environment middleware providing efficient multithreading on clusters , 2009, CF '09.

[55]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[56]  Graham Pullan,et al.  An Accelerated 3D Navier–Stokes Solver for Flows in Turbomachines , 2009 .

[57]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[58]  David H. Sharp,et al.  The dynamics of bubble growth for Rayleigh-Taylor unstable interfaces , 1987 .

[59]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[60]  Michael Griebel,et al.  Solving incompressible two-phase flows on multi-GPU clusters , 2013 .

[61]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[62]  Chi-Wang Shu Total-variation-diminishing time discretizations , 1988 .

[63]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .