Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives

We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67×. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22× better than s-GMRES by hiding all-reduces.

[1]  L. Reichel,et al.  A Newton basis GMRES implementation , 1994 .

[2]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[3]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[4]  Matthew G. Knepley,et al.  A stochastic performance model for pipelined Krylov methods , 2016, Concurr. Comput. Pract. Exp..

[5]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[6]  Sivasankaran Rajamanickam,et al.  Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  John Van Rosendale Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration , 1983, ICPP.

[8]  Jesús Labarta,et al.  CellSs: Making it easier to program the Cell Broadband Engine processor , 2007, IBM J. Res. Dev..

[9]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[10]  Kesheng Wu,et al.  A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[11]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[12]  Jack J. Dongarra,et al.  Improving the Performance of CA-GMRES on Multicores with Multiple GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[13]  G. W. Stewart Block Gram--Schmidt Orthogonalization , 2008, SIAM J. Sci. Comput..

[14]  Laura Grigori,et al.  Communication Avoiding ILU0 Preconditioner , 2015, SIAM J. Sci. Comput..

[15]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[16]  Martin Tillenius,et al.  SuperGlue: A Shared Memory Framework Using Data Versioning for Dependency-Aware Task-Based Parallelization , 2015, SIAM J. Sci. Comput..

[17]  Asim YarKhan,et al.  Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[18]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19]  Wim Vanroose,et al.  Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines , 2013, SIAM J. Sci. Comput..

[20]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[21]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[22]  Kesheng Wu,et al.  A Communication-Avoiding Thick-Restart Lanczos Method on a Distributed-Memory System , 2011, Euro-Par Workshops.