Improving strong scaling of the Conjugate Gradient method for solving large linear systems using global reduction pipelining

This paper presents performance results comparing MPI-based implementations of the popular Conjugate Gradient (CG) method and several of its communication hiding (or 'pipelined') variants. Pipelined CG methods are designed to efficiently solve SPD linear systems on massively parallel distributed memory hardware, and typically display significantly improved strong scaling compared to classic CG. This increase in parallel performance is achieved by overlapping the global reduction phase (MPI_Iallreduce) required to compute the inner products in each iteration by (chiefly local) computational work such as the matrix-vector product as well as other global communication. This work includes a brief introduction to the deep pipelined CG method for readers that may be unfamiliar with the specifics of the method. A brief overview of implementation details provides the practical tools required for implementation of the algorithm. Subsequently, easily reproducible strong scaling results on the US Department of Energy (DoE) NERSC machine 'Cori' (Phase I - Haswell nodes) on up to 1024 nodes with 16 MPI ranks per node are presented using an implementation of p(l)-CG that is available in the open source PETSc library. Observations on the staggering and overlap of the asynchronous, non-blocking global communication phases with communication and computational kernels are drawn from the experiments.

[1]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[2]  Anne Greenbaum,et al.  Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[3]  Jack J. Dongarra,et al.  Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[4]  E. F. DAzevedo,et al.  Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors , 1992 .

[5]  Jeffrey Cornelis,et al.  Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method , 2019, IEEE Transactions on Parallel and Distributed Systems.

[6]  Emmanuel Agullo,et al.  Analyzing the Effect of Local Rounding Error Propagation on the Maximal Attainable Accuracy of the Pipelined Conjugate Gradient Method , 2016, SIAM J. Matrix Anal. Appl..

[7]  Gérard Meurant Multitasking the conjugate gradient method on the CRAY X-MP/48 , 1987, Parallel Comput..

[8]  William Gropp,et al.  Scalable Non-blocking Preconditioned Conjugate Gradient Methods , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Wim Vanroose,et al.  Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines , 2013, SIAM J. Sci. Comput..

[10]  Laura Grigori,et al.  Enlarged Krylov Subspace Conjugate Gradient Methods for Reducing Communication , 2016, SIAM J. Matrix Anal. Appl..

[11]  Anthony T. Chronopoulos,et al.  Block s‐step Krylov iterative methods , 2010, Numer. Linear Algebra Appl..

[12]  J. Dongarra,et al.  HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems∗ , 2015 .

[13]  Z. Strakos,et al.  Krylov Subspace Methods: Principles and Analysis , 2012 .

[14]  Jocelyne Erhel,et al.  A parallel GMRES version for general sparse matrices. , 1995 .

[15]  Hong Zhang,et al.  Hierarchical Krylov and nested Krylov methods for extreme-scale computing , 2014, Parallel Comput..

[16]  John Shalf,et al.  The new landscape of parallel computer architecture , 2007 .

[17]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[18]  James Demmel,et al.  Avoiding Communication in Nonsymmetric Lanczos-Based Krylov Subspace Methods , 2013, SIAM J. Sci. Comput..

[19]  Siegfried Cools,et al.  Analyzing and improving maximal attainable accuracy in the communication hiding pipelined BiCGStab method , 2018, Parallel Comput..

[20]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[21]  Sascha M. Schnepp,et al.  Pipelined, Flexible Krylov Subspace Methods , 2015, SIAM J. Sci. Comput..

[22]  James Demmel,et al.  A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-Step Krylov Subspace Methods , 2014, SIAM J. Matrix Anal. Appl..

[23]  Jeffrey Cornelis,et al.  The Communication-Hiding Conjugate Gradient Method with Deep Pipelines , 2018, ArXiv.

[24]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[25]  Jocelyne Erhel,et al.  Varying the s in Your s-step GMRES , 2018 .

[26]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[27]  James Demmel,et al.  Parallel numerical linear algebra , 1993, Acta Numerica.

[28]  Wim Vanroose,et al.  The communication-hiding pipelined BiCGstab method for the parallel solution of large unsymmetric linear systems , 2016, Parallel Comput..

[29]  Marc Casas,et al.  Iteration-fusing conjugate gradient , 2017, ICS.

[30]  Miroslav Tuma,et al.  The Numerical Stability Analysis of Pipelined Conjugate Gradient Methods: Historical Context and Methodology , 2018, SIAM J. Sci. Comput..

[31]  Wim Vanroose,et al.  Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm , 2014, Parallel Comput..

[32]  Zdenek Strakos Effectivity and optimizing of algorithms and programs on the host-computer/array-processor system , 1987, Parallel Comput..