Tiled QR factorization algorithms

This work revisits existing algorithms for the QR factorization of rectangular matrices composed of p × q tiles, where p ≥ q. Within this framework, we study the critical paths and performance of algorithms such as SAMEH-KUCK, FI BONACCI, GREEDY, and those found within PLASMA. Al though neither FIBONACCI nor GREEDY is optimal, both are shown to be asymptotically optimal for all matrices of size p = q2 f(q), where f is any function such that lim+∞ f = 0. This novel and important complexity result applies to all matrices where p and q are proportional, p = λq, with λ ≥ 1, thereby encompassing many important situations in practice (least squares). We provide an extensive set of experiments that show the superiority of the new algorithms for tall matrices.

[1]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[2]  J. J. Modi,et al.  An alternative givens ordering , 1984 .

[3]  Thomas Hérault,et al.  QR factorization of tall and skinny matrices in a grid computing environment , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Jack Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010 .

[5]  Emmanuel Agullo,et al.  Tile QR factorization with parallel panel processing for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[6]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Yves Robert,et al.  Complexity of parallel QR factorization , 1986, JACM.

[8]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[9]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[10]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[11]  DongarraJack,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[12]  David J. Kuck,et al.  On Stable Parallel Linear System Solvers , 1978, JACM.

[13]  Jack Dongarra,et al.  Enhancing Parallelism of Tile QR Factorization for Multicore Architectures , 2010 .

[14]  M. Cosnard,et al.  Parallel QR decomposition of a rectangular matrix , 1986 .

[15]  Emmanuel Agullo,et al.  A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures , 2011, Euro-Par.

[16]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[17]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[18]  R. Clint Whaley,et al.  Achieving accurate and context‐sensitive timing for code optimization , 2008, Softw. Pract. Exp..