An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

We present a multi-stage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of on-chip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on-chip. The multi-stage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an auto-tuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that auto-tuning is a powerful tool that improves the performance by up to 5x, saves 17% and 32% of execution time on average respectively over static and dynamic tuning, and enables our multi-stage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 6-11x.

[1]  G. Halliwell,et al.  Evaluation of vertical coordinate and vertical mixing algorithms in the HYbrid-Coordinate Ocean Model (HYCOM) , 2004 .

[2]  Roger W. Hockney,et al.  A Fast Direct Solution of Poisson's Equation Using Fourier Analysis , 1965, JACM.

[3]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[4]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5]  S. Lennart Johnsson,et al.  Optimizing Tridiagonal Solvers for Alternating Direction Methods on Boolean Cube Multiprocessors , 1989, SIAM J. Sci. Comput..

[6]  Yao Zhang,et al.  Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[7]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[8]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[9]  Torben Hagerup,et al.  Optimal Merging and Sorting on the Erew Pram , 1989, Inf. Process. Lett..

[10]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[11]  Anne Greenbaum,et al.  Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[12]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[13]  Robert Strzodka,et al.  Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[14]  Chris R. Jesshope,et al.  Parallel Computers 2: Architecture, Programming and Algorithms , 1981 .

[15]  Wen-mei W. Hwu,et al.  Program optimization carving for GPU computing , 2008, J. Parallel Distributed Comput..

[16]  Daniel Egloff High performance finite difference PDE solvers on GPUs , 2011 .