A 3D Parallel Algorithm for QR Decomposition

Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs.

[1]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[2]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[3]  Torsten Hoefler,et al.  Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[5]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[6]  Alston S. Householder,et al.  Unitary Triangularization of a Nonsymmetric Matrix , 1958, JACM.

[7]  James Demmel,et al.  Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Oded Schwartz,et al.  Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication , 2016, TOPC.

[9]  James Demmel,et al.  A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem , 2016, SPAA.

[10]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[11]  James Demmel,et al.  Trade-Offs Between Synchronization, Communication, and Computation in Parallel Linear Algebra Computations , 2016 .

[12]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[13]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[14]  Martin D. Schatz,et al.  Parallel Matrix Multiplication: A Systematic Journey , 2016, SIAM J. Sci. Comput..

[15]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[16]  Alexander Tiskin Communication-efficient parallel generic pairwise elimination , 2007, Future Gener. Comput. Syst..

[17]  David A. Bader,et al.  Parallel algorithms for personalized communication and sorting with an experimental study (extended abstract) , 1996, SPAA '96.

[18]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[19]  David F. Gleich,et al.  Tall and skinny QR factorizations in MapReduce architectures , 2011, MapReduce '11.

[20]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[21]  Christos H. Papadimitriou,et al.  A Communication-Time Tradeoff , 1987, SIAM J. Comput..

[22]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[23]  Jack J. Dongarra,et al.  Improving the Performance of CA-GMRES on Multicores with Multiple GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[24]  Christian H. Bischof,et al.  A Basis-Kernel Representation of Orthogonal Matrices , 1995, SIAM J. Matrix Anal. Appl..

[25]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[26]  Roland W. Freund,et al.  Computing Fundamental Matrix Decompositions Accurately via the Matrix Sign Function in Two Iterations: The Power of Zolotarev's Functions , 2016, SIAM Rev..

[27]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..