Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially.

[1]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[2]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[3]  Jonathan Cohen,et al.  Title: A Fast Double Precision CFD Code using CUDA , 2009 .

[4]  A. Chorin Numerical solution of the Navier-Stokes equations , 1968 .

[5]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[6]  SugermanJeremy,et al.  Brook for GPUs , 2004 .

[7]  Mike Houston Stream computing , 2008, SIGGRAPH '08.

[8]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[9]  J. Kulpa,et al.  Time-frequency analysis using NVIDIA compute unified device architecture (CUDA) , 2009, Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA).

[10]  Robert A. van de Geijn,et al.  Making Programming Synonymous with Programming for Linear Algebra Libraries FLAME Working Note # 31 , 2008 .

[11]  Rafael Mayo,et al.  Solving Dense Linear Systems on Graphics Processors , 2008, Euro-Par.

[12]  Manfred Krafczyk,et al.  TeraFLOP computing on a desktop PC with GPUs for 3D CFD , 2008 .

[13]  Greg Humphreys,et al.  A multigrid solver for boundary value problems using programmable graphics hardware , 2003, HWWS '03.

[14]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[15]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[16]  Graham Pullan,et al.  Acceleration of a 3D Euler solver using commodity graphics hardware , 2008 .

[17]  Weiguo Liu,et al.  Molecular Dynamics Simulations on Commodity GPUs with CUDA , 2007, HiPC.

[18]  David Kaeli,et al.  Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units , 2009 .

[19]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[20]  Joel H. Ferziger,et al.  Computational methods for fluid dynamics , 1996 .

[21]  GrinspunEitan,et al.  Sparse matrix solvers on the GPU , 2003 .

[22]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[23]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[24]  U. Ghia,et al.  High-Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method , 1982 .

[25]  Jonathan M. Cohen,et al.  Low viscosity flow simulations for animation , 2008, SCA '08.

[26]  Mark J. Harris,et al.  Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware , 2007, Graphics Hardware.

[27]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[28]  Montserrat Bóo,et al.  Optimizing Monte Carlo radiosity on graphics hardware , 2009, The Journal of Supercomputing.

[29]  Eric Darve,et al.  Large calculation of the flow over a hypersonic vehicle using a GPU , 2008, J. Comput. Phys..

[30]  Roger L. Davis,et al.  Rapid Aerodynamic Performance Prediction on a Cluster of Graphics Processing Units , 2009 .

[31]  Michael Wimmer,et al.  Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, Vienna, Austria, September 3-4, 2006 , 2006, Graphics Hardware.

[32]  Ivan S Ufimtsev,et al.  Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. , 2008, Journal of chemical theory and computation.

[33]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[34]  Julien Thibault,et al.  IMPLEMENTATION OF A CARTESIAN GRID INCOMPRESSIBLE NAVIER-STOKES SOLVER ON MULTI-GPU DESKTOP PLATFORMS USING CUDA , 2009 .

[35]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[36]  José Ranilla,et al.  Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA , 2011, The Journal of Supercomputing.

[37]  Avi Bleiweiss,et al.  GPU accelerated pathfinding , 2008, GH '08.