论文信息 - Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives

Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives

We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67×. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22× better than s-GMRES by hiding all-reduces.

[1] L. Reichel,et al. A Newton basis GMRES implementation , 1994 .

[2] James Demmel,et al. Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[3] Mark Hoemmen,et al. Communication-avoiding Krylov subspace methods , 2010 .

[4] Matthew G. Knepley,et al. A stochastic performance model for pipelined Krylov methods , 2016, Concurr. Comput. Pract. Exp..

[5] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[6] Sivasankaran Rajamanickam,et al. Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7] John Van Rosendale. Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration , 1983, ICPP.

[8] Jesús Labarta,et al. CellSs: Making it easier to program the Cell Broadband Engine processor , 2007, IBM J. Res. Dev..

[9] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[10] Kesheng Wu,et al. A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..

[11] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[12] Jack J. Dongarra,et al. Improving the Performance of CA-GMRES on Multicores with Multiple GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[13] G. W. Stewart. Block Gram--Schmidt Orthogonalization , 2008, SIAM J. Sci. Comput..

[14] Laura Grigori,et al. Communication Avoiding ILU0 Preconditioner , 2015, SIAM J. Sci. Comput..

[15] Alejandro Duran,et al. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[16] Martin Tillenius,et al. SuperGlue: A Shared Memory Framework Using Data Versioning for Dependency-Aware Task-Based Parallelization , 2015, SIAM J. Sci. Comput..

[17] Asim YarKhan,et al. Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[18] James Demmel,et al. Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19] Wim Vanroose,et al. Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines , 2013, SIAM J. Sci. Comput..

[20] Jesús Labarta,et al. Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[21] Jack Dongarra,et al. QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[22] Kesheng Wu,et al. A Communication-Avoiding Thick-Restart Lanczos Method on a Distributed-Memory System , 2011, Euro-Par Workshops.