Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs -- The Power(q)-pattern Method

Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems -- as already exemplified by recent graphics processing units. Accordingly, numerical schemes need to be adapted and re-engineered in order to deliver scalable solutions across diverse processor configurations. Portability of parallel software solutions across emerging hardware platforms is another challenge. This work investigates multi-coloring and re-ordering schemes for block Gaus-Seidel methods and, in particular, for incomplete LU factorizations with and without fill-ins. We consider two matrix re-ordering schemes that deliver flexible and efficient parallel preconditioners. The general idea is to generate block decompositions of the system matrix such that the diagonal blocks are diagonal itself. In such a way, parallelism can be exploited on the block-level in a scalable manner. Our goal is to provide widely applicable, out-of-the-box preconditioners that can be used in the context of finite element solvers. We propose a new method for anticipating the fill-in pattern of ILU($p$) schemes which we call the power($q$)-pattern method . This method is based on an incomplete factorization of the system matrix $A$ subject to a predetermined pattern given by the matrix power $|A|^(p+1)$ and its associated multi-coloring permutation $. We prove that the obtained sparsity pattern is a superset of our modified ILU($p$) factorization applied to pi A pi^(-1). As a result, this modified ILU($p$) applied to multi-colored system matrix has no fill-ins in its diagonal blocks. This leads to an inherently parallel execution of triangular ILU($p$) sweeps. In addition, we describe the integration of the preconditioners into the HiFlow$^3$ open-source finite element package that provides a portable software solution across diverse hardware platforms. On this basis, we conduct performance analysis across a variety of test problems on multi-core CPUs and GPUs that proves efficiency, scalability and flexibility of our approach. Our preconditioners achieve a solver acceleration by a factor of up to 1.5, 8 and 85 for three different test problems. The GPU versions of the preconditioned solver are by a factor of up to 4 faster than an OpenMP parallel version on eight cores.

[1]  K. Chen,et al.  Matrix preconditioning techniques and applications , 2005 .

[2]  Philippe G. Ciarlet,et al.  The finite element method for elliptic problems , 2002, Classics in applied mathematics.

[3]  Vincent Heuveline HiFlow3: a flexible and hardware-aware parallel finite element package , 2010, POOSC '10.

[4]  L. Kolotilina,et al.  Factorized Sparse Approximate Inverse Preconditionings I. Theory , 1993, SIAM J. Matrix Anal. Appl..

[5]  Jan-Philipp Weiss,et al.  A multi-platform linear algebra toolbox for finite element solvers on heterogeneous clusters , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[6]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs , 2009 .

[7]  D. Braess Finite Elements: Theory, Fast Solvers, and Applications in Solid Mechanics , 1995 .

[8]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[9]  Mark Frederick Hoemmen,et al.  An Overview of Trilinos , 2003 .

[10]  O. Axelsson,et al.  Finite element solution of boundary value problemes - theory and computation , 2001, Classics in applied mathematics.

[11]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[12]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[13]  M. Benzi,et al.  A comparative study of sparse approximate inverse preconditioners , 1999 .

[14]  D. Chen Analysis , Implementation , and Evaluation of Vaidya ’ s Preconditioners , 2001 .

[15]  L. R. Scott,et al.  The Mathematical Theory of Finite Element Methods , 1994 .

[16]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[17]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[18]  Jan-Philipp Weiss,et al.  Scalable Multi-coloring Preconditioning for Multi-core CPUs and GPUs , 2010, Euro-Par Workshops.

[19]  Dominik Göddeke,et al.  Fast and accurate finite-element multigrid solvers for PDE simulations on GPU clusters , 2011 .