GPU Implementation of Finite Difference Solvers

This paper discusses the implementation of one-factor and three-factor PDE models on GPUs. Both explicit and implicit time-marching methods are considered, with the latter requiring the solution of multiple tridiagonal systems of equations.Because of the small amount of data involved, one-factor models are primarily compute-limited, with a very good fraction of the peak compute capability being achieved. The key to the performance lies in the heavy use of registers and shuffle instructions for the explicit method, and a non-standard hybrid Thomas/PCR algorithm for solving the tridiagonal systems for the implicit solverThe three-factor problems involve much more data, and hence their execution is more evenly balanced between computation and data communication to/from the main graphics memory. However, it is again possible to achieve a good fraction of the theoretical peak performance on both measures. The high performance requires particularly careful attention to coalescence in the data transfers, using local shared memory for small array transpositions, and padding to avoid shared memory bank conicts.Computational results include comparisons to computations on Sandy Bridge and Haswell Intel Xeon processors, using both multithreading and AVX vectorisation.

[1]  Massimiliano Fatica,et al.  Pricing American options with least squares Monte Carlo on GPUs , 2013, WHPCF '13.

[2]  Yao Zhang,et al.  Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[3]  H. H. Wang,et al.  A Parallel Method for Tridiagonal Equations , 1981, TOMS.

[4]  Henk A. van der Vorst,et al.  Large tridiagonal and block tridiagonal linear systems on vector and parallel computers , 1987, Parallel Comput..

[5]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[6]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[7]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[8]  Uwe Naumann,et al.  Adjoint Algorithmic Differentiation of a GPU Accelerated Application , 2013 .

[9]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[10]  Stefan Bondeli,et al.  Divide and conquer: a parallel algorithm for the solution of a tridiagonal linear system of equations , 1991, Parallel Comput..

[11]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[12]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[13]  Harold S. Stone,et al.  Parallel Tridiagonal Equation Solvers , 1975, TOMS.