论文信息 - Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling

Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling

Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficie...

[1] Marc Casas,et al. Iteration-fusing conjugate gradient , 2017, ICS.

[2] James Demmel,et al. Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[3] J. Navarro-Pedreño. Numerical Methods for Least Squares Problems , 1996 .

[4] Emmanuel Agullo,et al. Task-Based Conjugate Gradient: From Multi-GPU Towards Heterogeneous Architectures , 2016, Euro-Par Workshops.

[5] James Demmel,et al. Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6] James Demmel,et al. Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[7] Ed Anderson,et al. LAPACK Users' Guide , 1995 .

[8] Jack J. Dongarra,et al. A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems , 2015, Concurr. Comput. Pract. Exp..

[9] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .

[10] Jack J. Dongarra,et al. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[11] W. Morven Gentleman,et al. Row elimination for solving sparse linear systems and least squares problems , 1976 .

[12] James Demmel,et al. LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version , 2012, SIAM J. Matrix Anal. Appl..

[13] Jack J. Dongarra,et al. Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14] Jack J. Dongarra,et al. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[15] J. Dongarra,et al. Generalized QR factorization and its applications , 1992 .

[16] Mark Hoemmen,et al. A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[17] Padma Raghavan,et al. Distributed Orthogonal Factorization , 1989 .

[18] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.