Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Fast, robust and efficient multigrid solvers are a key numerical tool in the solution of partial differential equations discretised with finite elements. The vast majority of practical simulation scenarios requires that the underlying grid is unstructured, and that high-order discretisations are used. On the other hand, hardware is quickly evolving towards parallelism and heterogeneity, even within a single workstation. Commodity CPUs have multiple cores, and GPUs are the most prominent example of current fine-grained parallel architectures. We are convinced that geometric multigrid methods are superior to algebraic multigrid methods, if their components are designed with respect to the underlying finite element discretisation. Such an approach, which we call finite element geometric multigrid (FE-GMG), allows the design and development of numerically optimal solvers. While many multigrid components can be parallelised in a straight forward manner, two components pose severe challenges: Robust and strong smoothers are inherently recursive and sequential, and grid transfer operations (prolongation and restriction) have to be re-formulated for the chosen finite element space and mesh hierarchy. Our approach follows the hardware-oriented numerics paradigm and we aim at simultaneously maximising numerical and computational efficiency. In this paper, we tackle the second problem and evaluate an implementation technique for geometric multigrid solvers that is based completely on sequences of sparse matrix-vector multiplications. With no loss in performance and only moderately increased memory requirements, this approach allows us to design a multigrid solver that is oblivious of the spatial dimension of the computational domain, the underlying unstructured discretisation grid, and the chosen finite element space. We are thus the first to assemble competitive geometric multigrid solvers for finite element discretisations on unstructured grids that execute very efficiently on both CPUs and GPUs. Our numerical evaluation yields that the FE-GMG completely assembled by sequences of sparse matrix-vector kernels is able to exploit the parallelism provided by multicore CPUs and GPUs: We gain a speedup of two to three when doubling the amount of memory controllers of the CPU and yet another factor of four to 14 (eight in average) when switching from a multi-core CPU to the GPU. Additionally we show that the numbering of the degrees of freedom can have a huge impact on the performance, up to a factor of 25 on both architectures.

[1]  Zhuo Feng,et al.  Multigrid on GPU: Tackling Power Grid Analysis on parallel SIMT platforms , 2008, 2008 IEEE/ACM International Conference on Computer-Aided Design.

[2]  Manfred Liebmann,et al.  A Parallel Algebraic Multigrid Solver on Graphics Processing Units , 2009, HPCA.

[3]  Eric Darve,et al.  Large calculation of the flow over a hypersonic vehicle using a GPU , 2008, J. Comput. Phys..

[4]  John D. Owens,et al.  Efficient Parallel Scan Algorithms for Manycore GPUs , 2010, Scientific Computing with Multicore and Accelerators.

[5]  Robert Strzodka,et al.  Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  Michael Garland,et al.  Understanding throughput-oriented architectures , 2010, Commun. ACM.

[7]  Stefan Turek,et al.  On ordering strategies in a multigrid algorithm , 1993 .

[8]  Robert Strzodka,et al.  Mixed-Precision GPU-Multigrid Solvers with Strong Smoothers , 2010, Scientific Computing with Multicore and Accelerators.

[9]  Jonathan M. Cohen,et al.  Low viscosity flow simulations for animation , 2008, SCA '08.

[10]  Oliver Bröker,et al.  Sparse approximate inverse smoothers for geometric and algebraic multigrid , 2002 .

[11]  Greg Humphreys,et al.  A multigrid solver for boundary value problems using programmable graphics hardware , 2003, HWWS '03.

[12]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[13]  Guillaume Caumon,et al.  Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[14]  Carsten Gutwenger,et al.  Improved Scalability by Using Hardware-Aware Thread Affinities , 2010, Facing the Multicore-Challenge.

[15]  Stefan Turek,et al.  The Influence of Higher Order FEM Discretisations on Multigrid Convergence , 2006 .

[16]  Dirk Ribbrock,et al.  HONEI: A collection of libraries for numerical computations targeting multiple processor architectures , 2009, Comput. Phys. Commun..

[17]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs , 2009 .

[18]  Peter Thoman,et al.  GPU-Based Multigrid: Real-Time Performance in High Resolution Nonlinear Image Processing , 2008, ICVS.

[19]  Ester M. Garzón,et al.  Improving the Performance of the Sparse Matrix Vector Product with GPUs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[20]  Jonathan Cohen,et al.  Title: A Fast Double Precision CFD Code using CUDA , 2009 .

[21]  Wolfgang Hackbusch,et al.  Multi-grid methods and applications , 1985, Springer series in computational mathematics.

[22]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[23]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[24]  Gordon Erlebacher,et al.  High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster , 2010, J. Comput. Phys..

[25]  Samuel Williams,et al.  Sparse Matrix-Vector Multiplication on Multicore and Accelerators , 2010, Scientific Computing with Multicore and Accelerators.

[26]  Stefan Turek,et al.  Numerical Simulation and Benchmarking of a Monolithic Multigrid Solver for Fluid-Structure Interaction Problems with Application to Hemodynamics , 2011 .

[27]  Nectarios Koziris,et al.  Understanding the Performance of Sparse Matrix-Vector Multiplication , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[28]  Eric Darve,et al.  Assembly of finite element methods on graphics processors , 2011 .

[29]  Michael M. Kazhdan,et al.  Streaming multigrid for gradient-domain operations on large images , 2008, ACM Trans. Graph..

[30]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.