We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations. However, there are significant additional costs for creating and performing the updates, which prohibit the efficient use of the recursion for large n. We present a quantitative analysis of these extra costs. This analysis leads us to introduce a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20% for large square matrices and up to almost a factor of 3 for tall thin matrices. Uniprocessor performance results are presented for two IBM RS/6000® SP nodes-a 120-MHz IBM POWER2 node and one processor of a four-way 332-MHz IBM PowerPC® 604e SMP node. The hybrid recursive algorithm reaches more than 90% of the theoretical peak performance of the POWER2 node. Compared to standard block algorithms, the recursive approach also shows a significant advantage in the automatic tuning obtained from its automatic variable blocking. A successful parallel implementation on a four-way 332-MHz IBM PPC604e SMP node based on dynamic load balancing is presented. For two, three, and four processors it shows speedups of up to 1.97, 2.99, and 3.97.
[1]
R. Willoughby,et al.
Some results on sparse matrices
,
1970
.
[2]
Erik Elmroth,et al.
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
,
1998,
PARA.
[3]
L. Kaufman,et al.
Squeezing the most out of eigenvalue solvers on high-performance computers
,
1986
.
[4]
Erik Elmroth,et al.
A Ring-Oriented Approach for Block Matrix Factorizations on Shared and Distributed Memory Architectures
,
1993,
PPSC.
[5]
Erik Elmroth,et al.
Parallel Block Matrix Factorizations on the Shared-Memory Multiprocessor Ibm 3090 VF/600J
,
1992
.
[6]
Sivan Toledo.
Locality of Reference in LU Decomposition with Partial Pivoting
,
1997,
SIAM J. Matrix Anal. Appl..
[7]
Christian H. Bischof,et al.
The WY representation for products of householder matrices
,
1985,
PPSC.
[8]
Fred G. Gustavson,et al.
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
,
1997,
IBM J. Res. Dev..
[9]
M L Arendt.
Practical parallel processing
,
1986
.
[10]
C. Loan,et al.
A Storage-Efficient $WY$ Representation for Products of Householder Transformations
,
1989
.