A Compiler-Blockable Algorithm for QR Decomposition

Because of an imbalance between computation and memory speed in modern processors, programmers are explicitly restructuring codes to perform well on particular memory systems, leading to machine-speciic programs. This paper describes a block algorithm for QR decomposition that is derivable by the compiler and has good performance on small matrices | sizes that are typically run on nodes of a massively parallel system or workstation. The advantage of our algorithm over the one found in LAPACK is that it can be derived by the compiler and needs no hand optimization.

[1]  Ken Kennedy,et al.  Parallel Programming Support in ParaScope , 1988, Parallel Computing in Science and Engineering.

[2]  Ken Kennedy,et al.  Analysis of interprocedural side effects in a parallel programming environment , 1988, J. Parallel Distributed Comput..

[3]  Jack J. Dongarra,et al.  Solving linear systems on vector and shared memory computers , 1990 .

[4]  Steven Mark Carr,et al.  Memory-hierarchy management , 1993 .

[5]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[6]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[7]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[8]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[9]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[10]  Richard B. Lehoucq,et al.  Implementing Efficient and Portable Dense Matrix Factorizations , 1991, SIAM Conference on Parallel Processing for Scientific Computing.

[11]  Ken Kennedy,et al.  Blocking Linear Algebra Codes for Memory Hierarchies , 1989, PPSC.