A Block QR Factorization Scheme for Loosely Coupled Systems of Array Processors

A statically scheduled parallel block QR factorization procedure is described. It is based on "block" Givens rotations and is modeled after the Gentleman-Kung systolic QR procedure. Independent tasks are associated with each block column. "Tallest possible" subproblems are always solved. The method has been implemented on the IBM Kingston LCAP-1 system, which consists of ten FPS-164/MAX array processors that can communicate through a large shared bulk memory. The implementation revealed much about the tradeoff between block size and load balancing. Large blocks make load balancing more difficult but give better 164/MAX performance and less shared memory traffic. The results obtained indicate that our approach to parallelizing the QR factorization is competitive for very large problems, e.g., of the order 5000-by-1000.