A fine-grained block ILU scheme on regular structures for GPGPUs

Abstract Iterative methods based on block incomplete LU (BILU) factorization are considered highly effective for solving large-scale block-sparse linear systems resulting from coupled PDE systems with n equations. However, efforts on porting implicit PDE solvers to massively parallel shared-memory heterogeneous architectures, such as general-purpose graphics processing units (GPGPUs), have largely avoided BILU, leaving their enormous performance potential unfulfilled in many applications where the use of implicit schemes and BILU-type preconditioners/solvers is highly preferred. Indeed, strong inherent data dependency and high memory bandwidth demanded by block matrix operations render naive adoptions of existing sequential BILU algorithms extremely inefficient on GPGPUs. In this study, we present a fine-grained BILU (FGBILU) scheme which is particularly effective on GPGPUs. A straightforward one-sweep wavefront ordering is employed to resolve data dependency. Granularity is substantially refined as block matrix operations are carried out in a true element-wise approach. Particularly, the inversion of diagonal blocks, a well-known bottleneck, is accomplished by a parallel in-place Gauss–Jordan elimination. As a result, FGBILU is able to offer low-overhead concurrent computation at O ( n 2 N 2 ) scale on a 3D PDE domain with a linear scale of N . FGBILU has been implemented with both OpenACC and CUDA and tested as a block-sparse linear solver on a structured 3D grid. While FGBILU remains mathematically identical to sequential global BILU, numerical experiments confirm its exceptional performance on an Nvidia GPGPU.

[2]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Henk A. van der Vorst,et al.  Parallel incomplete factorizations with pseudo-overlapped subdomains , 2001, Parallel Comput..

[4]  G. Wittum On the Robustness of ILU Smoothing , 1989 .

[5]  Eric J. Nielsen,et al.  Production Level CFD Code Acceleration for Hybrid Many-Core Architectures , 2012 .

[6]  David Engel Computer Solution Of Linear Algebraic Systems , 2016 .

[7]  J. Edwards,et al.  Large eddy simulation and zonal modeling of human-induced contaminant transport. , 2008, Indoor air.

[8]  Graham Pullan,et al.  Acceleration of a 3D Euler solver using commodity graphics hardware , 2008 .

[9]  Jung Il Choi,et al.  An immersed boundary method for complex incompressible flows , 2007, J. Comput. Phys..

[10]  Jonas Thies,et al.  Design of a Parallel Hybrid Direct/Iterative Solver for CFD Problems , 2011, 2011 IEEE Seventh International Conference on eScience.

[11]  Hiroshi Okuda,et al.  GPU Acceleration for FEM-Based Structural Analysis , 2013 .

[12]  Frank Mueller,et al.  GPU Port of A Parallel Incompressible Navier-Stokes Solver based on OpenACC and MVAPICH2 , 2014 .

[13]  I. Duff,et al.  The effect of ordering on preconditioned conjugate gradients , 1989 .

[14]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[15]  Yousef Saad,et al.  A Parallel Multistage ILU Factorization Based on a Hierarchical Graph Decomposition , 2006, SIAM J. Sci. Comput..

[16]  Rainald Löhner,et al.  Semi‐automatic porting of a large‐scale Fortran CFD code to GPUs , 2012 .

[17]  Jack R. Edwards,et al.  An investigation of interface-sharpening schemes for multi-phase mixture flows , 2009, J. Comput. Phys..

[18]  G. Forsythe,et al.  Computer solution of linear algebraic systems , 1969 .

[19]  Alex Pothen,et al.  A Scalable Parallel Algorithm for Incomplete Factor Preconditioning , 2000, SIAM J. Sci. Comput..

[20]  Jan-Philipp Weiss,et al.  Parallel Smoothers for Matrix-Based Geometric Multigrid Methods on Locally Refined Meshes Using Multicore CPUs and GPUs , 2011, Facing the Multicore-Challenge.

[21]  J. Meijerink,et al.  An iterative solution method for linear systems of which the coefficient matrix is a symmetric -matrix , 1977 .

[22]  J. Edwards,et al.  Low-Diffusion Flux-Splitting Methods for Flows at All Speeds , 1997 .

[23]  J. Edwards,et al.  Investigations of Lift-Based Pitch-Plunge Equivalence for Airfoils at Low Reynolds Numbers , 2011 .

[24]  Bronis R. de Supinski,et al.  OpenMP for Accelerators , 2011, IWOMP.

[25]  Y. Saad,et al.  Iterative solution of linear systems in the 20th century , 2000 .

[26]  Takumi Washio,et al.  Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations , 1999, Parallel Comput..

[27]  Lin Fu,et al.  A multi-block viscous flow solver based on GPU parallel methodology , 2014 .

[28]  G. Meurant The block preconditioned conjugate gradient method on vector computers , 1984 .

[29]  Ashok Gopalarathnam,et al.  A Time-Lag Approach for Prediction of Trailing-Edge Separation in Unsteady Flow , 2014 .

[30]  G. Golub,et al.  Block Preconditioning for the Conjugate Gradient Method , 1985 .

[31]  O. Axelsson,et al.  On some versions of incomplete block-matrix factorization iterative methods , 1984 .

[32]  J. Edwards,et al.  Large-eddy simulation of human-induced contaminant transport in room compartments. , 2012, Indoor air.

[33]  Jack R. Edwards,et al.  OpenACC-based GPU Acceleration of a 3-D Unstructured Discontinuous Galerkin Method , 2014 .

[34]  J. Edwards,et al.  An unsteady airfoil theory applied to pitching motions validated against experiment and computation , 2013 .

[35]  H. V. D. Vorst,et al.  High Performance Preconditioning , 1989 .

[36]  A. Chorin Numerical Solution of the Navier-Stokes Equations* , 1989 .