Parallel Thomas approach development for solving tridiagonal systems in GPU programming − steady and unsteady flow simulation

The solution of tridiagonal system of equations using graphic processing units (GPU) is assessed. The parallel-Thomas-algorithm (PTA) is developed and the solution of PTA is compared to two known parallel algorithms, i.e. cyclic-reduction (CR) and parallel-cyclic-reduction (PCR). Lid-driven cavity problem is considered to assess these parallel approaches. This problem is also simulated using the classic Thomas algorithm that runs on a central processing unit (CPU). Runtimes and physical parameters of the mentioned GPU and CPU algorithms are compared. The results show that the speedup of CR, PCR and PTA against the CPU runtime is 4.4x ,5.2x and 38.5x , respectively. Furthermore, the effect of coalesced and uncoalesced memory access to GPU global memory is examined for PTA, and a 2x -speedup is achieved for the coalesced memory access. Additionally, the PTA performance in a time dependent problem, the unsteady flow over a square, is assessed and a 9x-speedup is obtained against the CPU.

[1]  Hee-Seok Kim,et al.  A Scalable Tridiagonal Solver for GPUs , 2011, 2011 International Conference on Parallel Processing.

[2]  Yao Zhang,et al.  Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[3]  Mohammad Torabzadeh,et al.  An efficient GPU implementation of cyclic reduction solver for high-order compressible viscous flow simulations , 2014 .

[4]  Robert Strzodka,et al.  Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[5]  Firas Hamze,et al.  A Performance Comparison of CUDA and OpenCL , 2010, ArXiv.

[6]  Kyung-Soo Yang,et al.  Numerical Study of Flow past a Square Cylinder with an Angle of Incidence , 2009 .

[7]  Vahid Esfahanian,et al.  Assessment of WENO schemes for numerical simulation of some hyperbolic equations using GPU , 2013 .

[8]  F. Durst,et al.  Accurate computations of the laminar flow past a square cylinder based on two different methods : lattice-Boltzmann and finite-volume , 2000 .

[9]  Yao Zhang,et al.  An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Sotiris Ioannidis,et al.  Gnort: High Performance Network Intrusion Detection Using Graphics Processors , 2008, RAID.

[11]  Jasmine Banks,et al.  Implementation of parallel tridiagonal solvers for a heterogeneous computing environment , 2016 .

[12]  Firat Oguz Edis,et al.  A GPU application for high-order compact finite difference scheme , 2012 .

[13]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[14]  Julien Demouth,et al.  GPU Implementation of Finite Difference Solvers , 2014, 2014 Seventh Workshop on High Performance Computational Finance.

[15]  U. Ghia,et al.  High-Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method , 1982 .

[16]  Vahid Esfahanian,et al.  Assessment of WENO schemes for multi‐dimensional Euler equations using GPU , 2014 .

[17]  Tee Tai Lim,et al.  The vortex-shedding process behind two-dimensional bluff bodies , 1982, Journal of Fluid Mechanics.

[18]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[19]  Daniel Egloff High performance finite difference PDE solvers on GPUs , 2011 .