Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives
暂无分享,去创建一个
Jack J. Dongarra | Piotr Luszczek | Ichitaro Yamazaki | Mark Hoemmen | J. Dongarra | P. Luszczek | I. Yamazaki | M. Hoemmen
[1] L. Reichel,et al. A Newton basis GMRES implementation , 1994 .
[2] James Demmel,et al. Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.
[3] Mark Hoemmen,et al. Communication-avoiding Krylov subspace methods , 2010 .
[4] Matthew G. Knepley,et al. A stochastic performance model for pipelined Krylov methods , 2016, Concurr. Comput. Pract. Exp..
[5] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[6] Sivasankaran Rajamanickam,et al. Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[7] John Van Rosendale. Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration , 1983, ICPP.
[8] Jesús Labarta,et al. CellSs: Making it easier to program the Cell Broadband Engine processor , 2007, IBM J. Res. Dev..
[9] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .
[10] Kesheng Wu,et al. A Block Orthogonalization Procedure with Constant Synchronization Requirements , 2000, SIAM J. Sci. Comput..
[11] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..
[12] Jack J. Dongarra,et al. Improving the Performance of CA-GMRES on Multicores with Multiple GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[13] G. W. Stewart. Block Gram--Schmidt Orthogonalization , 2008, SIAM J. Sci. Comput..
[14] Laura Grigori,et al. Communication Avoiding ILU0 Preconditioner , 2015, SIAM J. Sci. Comput..
[15] Alejandro Duran,et al. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..
[16] Martin Tillenius,et al. SuperGlue: A Shared Memory Framework Using Data Versioning for Dependency-Aware Task-Based Parallelization , 2015, SIAM J. Sci. Comput..
[17] Asim YarKhan,et al. Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .
[18] James Demmel,et al. Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[19] Wim Vanroose,et al. Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines , 2013, SIAM J. Sci. Comput..
[20] Jesús Labarta,et al. Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..
[21] Jack Dongarra,et al. QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .
[22] Kesheng Wu,et al. A Communication-Avoiding Thick-Restart Lanczos Method on a Distributed-Memory System , 2011, Euro-Par Workshops.