Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling

Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficie...

[1]  Marc Casas,et al.  Iteration-fusing conjugate gradient , 2017, ICS.

[2]  James Demmel,et al.  Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[3]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[4]  Emmanuel Agullo,et al.  Task-Based Conjugate Gradient: From Multi-GPU Towards Heterogeneous Architectures , 2016, Euro-Par Workshops.

[5]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[6]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[7]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[8]  Jack J. Dongarra,et al.  A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems , 2015, Concurr. Comput. Pract. Exp..

[9]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[10]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[11]  W. Morven Gentleman,et al.  Row elimination for solving sparse linear systems and least squares problems , 1976 .

[12]  James Demmel,et al.  LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version , 2012, SIAM J. Matrix Anal. Appl..

[13]  Jack J. Dongarra,et al.  Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[15]  J. Dongarra,et al.  Generalized QR factorization and its applications , 1992 .

[16]  Mark Hoemmen,et al.  A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[17]  Padma Raghavan,et al.  Distributed Orthogonal Factorization , 1989 .

[18]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.