NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. To achieve a good scalability using this approach is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). The results given in this study proves that the implementation carried out in this work is able to beat the reference code when dealing with a relatively large number of Tridiagonal systems (2,000–256,000), being closed to \(3{\times }\) (in double precision) and \(4{\times }\) (in single precision) faster using one Kepler NVIDIA GPU.

[1]  Jack J. Dongarra,et al.  The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.

[2]  Yao Zhang,et al.  Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[3]  Yao Zhang,et al.  An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[4]  Manuel Prieto,et al.  Block Tridiagonal Solvers on Heterogeneous Architectures , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[5]  Ioan Raicu,et al.  Many-Task Computing on Many-Core Architectures , 2016, Scalable Comput. Pract. Exp..

[6]  Manuel Prieto,et al.  Fast finite difference Poisson solvers on heterogeneous architectures , 2014, Comput. Phys. Commun..

[7]  ZhangYao,et al.  Fast tridiagonal solvers on the GPU , 2010 .

[8]  Anne Greenbaum,et al.  Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[9]  Jesús Labarta,et al.  cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project* , 2017, ICCS.

[10]  Hee-Seok Kim,et al.  A Scalable Tridiagonal Solver for GPUs , 2011, 2011 International Conference on Parallel Processing.

[11]  S. Lennart Johnsson,et al.  Optimizing Tridiagonal Solvers for Alternating Direction Methods on Boolean Cube Multiprocessors , 1989, SIAM J. Sci. Comput..

[12]  G. Halliwell,et al.  Evaluation of vertical coordinate and vertical mixing algorithms in the HYbrid-Coordinate Ocean Model (HYCOM) , 2004 .