Batched transpose-free ADI-type preconditioners for a Poisson solver on GPGPUs

Abstract We investigate the iterative solution of a symmetric positive definite linear system involving the shifted Laplacian as the system matrix on General Purpose Graphics Processing Units (GPGPUs). We consider in particular the Chebyshev iteration for its reduced global communication. The ADI-type preconditioner involves solving multiple (batched) symmetric positive tridiagonal Toeplitz systems along each coordinate direction. We investigate several variants how to solve these tridiagonal systems, the Thomas algorithm, the Thomas combined with the SPIKE algorithm, and a polynomial approximation of the inverse. We test the various implementations numerically by means of two- and three-dimensional examples. It turns out that a combination of the Thomas algorithm and the approximate inverse leads to a solution that does not need either tiling or transpositions. As such none of the kernels uses an extensive amount of shared memory which yields a very high GPU utilization and more importantly optimal coalesced global memory access patterns.

[1]  Uri M. Ascher,et al.  A First Course in Numerical Methods , 2011 .

[2]  Lubomir Riha,et al.  Acceleration Techniques for FETI Solvers for GPU Accelerators , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[3]  Gene H. Golub,et al.  Matrix computations , 1983 .

[4]  Hee-Seok Kim,et al.  A scalable, numerically stable, high-performance tridiagonal solver using GPUs , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Robert Strzodka,et al.  Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  Jean-Luc Guermond,et al.  A new class of massively parallel direction splitting for the incompressible Navier―Stokes equations , 2011 .

[7]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[8]  Jean-Luc Guermond,et al.  A new class of fractional step techniques for the incompressible Navier–Stokes equations using direction splitting , 2010 .

[9]  Yao Zhang,et al.  Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[10]  Ahmed H. Sameh,et al.  A parallel hybrid banded system solver: the SPIKE algorithm , 2006, Parallel Comput..

[11]  Peter D. Minev,et al.  Start‐up flow in a three‐dimensional lid‐driven cavity by means of a massively parallel direction splitting algorithm , 2012 .

[12]  D. Heller A Survey of Parallel Algorithms in Numerical Linear Algebra. , 1978 .

[13]  Martin H. Gutknecht,et al.  The Chebyshev iteration revisited , 2002, Parallel Comput..

[14]  Jack J. Dongarra,et al.  Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..

[15]  Michael B. Giles,et al.  Manycore Algorithms for Batch Scalar and Block Tridiagonal Solvers , 2016, ACM Trans. Math. Softw..

[16]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[17]  Wim Vanroose,et al.  Efficient Implementation of Total FETI Solver for Graphic Processing Units Using Schur Complement , 2015, HPCSE.

[18]  Roger W. Hockney,et al.  A Fast Direct Solution of Poisson's Equation Using Fourier Analysis , 1965, JACM.

[19]  Kenli Li,et al.  A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems , 2017, The Journal of Supercomputing.

[20]  Wen-mei W. Hwu,et al.  A Guide for Implementing Tridiagonal Solvers on GPUs , 2014, Numerical Computations with GPUs.

[21]  Antonio J. Peña,et al.  cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs , 2018, Concurr. Comput. Pract. Exp..

[22]  Yao Zhang,et al.  An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[23]  R W Hockney,et al.  Computer Simulation Using Particles , 1966 .

[24]  Brian J. Murphy,et al.  Solving tridiagonal systems on a GPU , 2013, 20th Annual International Conference on High Performance Computing.

[25]  F. R. Gantmakher The Theory of Matrices , 1984 .

[26]  Margarita Amor López,et al.  Solving Multiple Tridiagonal Systems on a Multi-GPU Platform , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[27]  David J. Kuck,et al.  On Stable Parallel Linear System Solvers , 1978, JACM.

[28]  Desh Ranjan,et al.  Memory-Efficient Parallel Simulation of Electron Beam Dynamics Using GPUs , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[29]  Ramón Doallo,et al.  Solving Large Problem Sizes of Index-Digit Algorithms on GPU: FFT and Tridiagonal System Solvers , 2018, IEEE Transactions on Computers.