Petascale solvers for anisotropic PDEs in atmospheric modelling on GPU clusters

Multi-GPU parallelisation of scalable CG and multigrid solvers for anisotropic PDEs.Efficient matrix-free CUDA implementation minimises global memory access on GPUs.Excellent weak scaling on up to 16,384 nVidia K20X cards (44 mio. cores, Titan, OLCF).PDEs with 5x1011 unknowns can be solved in 1?s to an accuracy of 10 - 5 .Achieved performance of 0.78PFLOPs and memory bandwidth utilisation of more than 40%. Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with trillions ( O ( 10 12 ) ) unknowns the code has to make efficient use of several million individual processor cores on large GPU clusters.We describe the multi-GPU implementation of two algorithmically optimal iterative solvers for anisotropic PDEs which are encountered in (semi-) implicit time stepping procedures in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. In particular, an important constant in the discretisation is the CFL number; only the multigrid solver is robust to changes in this constant. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with 0.55 ? 1012unknowns on 16384 GPUs; this corresponds to about 3% of the theoretical peak performance of the machine and we use more than 40% of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.

[1]  Stefan Turek,et al.  Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on Graphics Processing Units , 2011 .

[2]  Xiaozhe Hu,et al.  Parallel Unsmoothed Aggregation Algebraic Multigrid Algorithms on GPUs , 2013 .

[3]  Andreas Dedner,et al.  Efficient multigrid preconditioners for atmospheric flow simulations at high aspect ratio , 2014, 1408.2981.

[4]  Robert Strzodka,et al.  Using GPUs to improve multigrid solver performance on a cluster , 2008, Int. J. Comput. Sci. Eng..

[5]  Stefan Turek,et al.  Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs , 2011 .

[6]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[7]  Hiroshi Okuda,et al.  Conjugate gradients on multiple GPUs , 2010 .

[8]  Manfred Liebmann,et al.  A Parallel Algebraic Multigrid Solver on Graphics Processing Units , 2009, HPCA.

[9]  Wolfgang Straßer,et al.  A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[10]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[11]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[12]  A. Staniforth,et al.  A new dynamical core for the Met Office's global and regional modelling of the atmosphere , 2005 .

[13]  Chao Yang,et al.  A peta-scalable CPU-GPU algorithm for global atmospheric simulations , 2013, PPoPP '13.

[14]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[15]  Satoshi Matsuoka,et al.  Fast Conjugate Gradients with Multiple GPUs , 2009, ICCS.

[16]  Michael Griebel,et al.  A multi-GPU accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations , 2010, Computer Science - Research and Development.

[17]  Ulrich Rüde,et al.  Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters , 2014, Concurr. Comput. Pract. Exp..

[18]  R. Sadourny Conservative Finite-Difference Approximations of the Primitive Equations on Quasi-Uniform Spherical Grids , 1972 .

[19]  Peter Bastian,et al.  The Iterative Solver Template Library , 2006, PARA.

[20]  Xu Guo,et al.  Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs , 2013, Comput. Vis. Sci..

[21]  M. Diamantakis,et al.  An inherently mass‐conserving semi‐implicit semi‐Lagrangian discretization of the deep‐atmosphere global non‐hydrostatic equations , 2014 .

[22]  William H. Press,et al.  Numerical recipes: the art of scientific computing, 3rd Edition , 2007 .

[23]  Greg Humphreys,et al.  A multigrid solver for boundary value problems using programmable graphics hardware , 2003, HWWS '03.

[24]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[25]  Piotr K. Smolarkiewicz,et al.  Preconditioned Conjugate-Residual Solvers for Helmholtz Equations in Nonhydrostatic Models , 1997 .

[26]  Markus Blatt A Parallel Algebraic Multigrid Method for Elliptic Problems with Highly Discontinuous Coefficients , 2010 .

[27]  Mauro Bianco An Interface for Halo Exchange Pattern , 2013 .

[28]  Inanc Senocak,et al.  A Full-Depth Amalgamated Parallel 3D Geometric Multigrid Solver for GPU Clusters , 2011 .

[29]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[30]  S. Menon,et al.  Implementation of an Efficient Conjugate Gradient Algorithm for Poisson Solutions on Graphics Processors , 2007 .

[31]  Jean Côté,et al.  Preconditioning for an Iterative Elliptic Solver on a Vector Processor , 2002, VECPAR.

[32]  William L. Briggs,et al.  A multigrid tutorial, Second Edition , 2000 .

[33]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[34]  Stephen J. Thomas,et al.  Massively Parallel Implementation of the Mesoscale Compressible Community Model , 1997, Parallel Comput..

[35]  Ralf Hiptmair,et al.  Analysis of tensor product multigrid , 2001, Numerical Algorithms.

[36]  Zhiyu Zeng,et al.  Parallel multigrid preconditioning on graphics processing units (GPUs) for robust power grid analysis , 2010, Design Automation Conference.

[37]  Satoshi Matsuoka,et al.  High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning , 2010, Computer Science - Research and Development.

[38]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.