论文信息 - An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

We present a multi-stage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of on-chip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on-chip. The multi-stage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an auto-tuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that auto-tuning is a powerful tool that improves the performance by up to 5x, saves 17% and 32% of execution time on average respectively over static and dynamic tuning, and enables our multi-stage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 6-11x.

[1] G. Halliwell,et al. Evaluation of vertical coordinate and vertical mixing algorithms in the HYbrid-Coordinate Ocean Model (HYCOM) , 2004 .

[2] Roger W. Hockney,et al. A Fast Direct Solution of Poisson's Equation Using Fourier Analysis , 1965, JACM.

[3] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[4] Xipeng Shen,et al. A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[5] S. Lennart Johnsson,et al. Optimizing Tridiagonal Solvers for Alternating Direction Methods on Boolean Cube Multiprocessors , 1989, SIAM J. Sci. Comput..

[6] Yao Zhang,et al. Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[7] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[8] Alan Edelman,et al. PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[9] Torben Hagerup,et al. Optimal Merging and Sorting on the Erew Pram , 1989, Inf. Process. Lett..

[10] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.

[11] Anne Greenbaum,et al. Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[12] Uday Bondhugula,et al. A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[13] Robert Strzodka,et al. Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[14] Chris R. Jesshope,et al. Parallel Computers 2: Architecture, Programming and Algorithms , 1981 .

[15] Wen-mei W. Hwu,et al. Program optimization carving for GPU computing , 2008, J. Parallel Distributed Comput..

[16] Daniel Egloff. High performance finite difference PDE solvers on GPUs , 2011 .