A balanced submatrix merging algorithm for multiprocessor architectures

In this article we describe a parallel algorithm which applies Givens rotations to selectively annihilate k(k + 1)/2 nonzero elements from two k × n (kn) upper trapeziodal submatrices. The new algorithm we propose is suitable for implementation on either a pair of directly connected local-memory processors or two clusters of multiple tightly-coupled processors. Our analyses show that in both cases the proposed algorithms achieve optimal speed-up by balancing the work load distribution and masking inter-processor or inter-cluster communication by computation if k ⪡ n. In the context of solving large scale least squares problems [1,4], this submatrix merging step is repetitively needed during the entire computation and, furthermore, there are usually many pairs of such submatrices to be merged with each submatrix stored in the memory of a processor or a cluster of processors. The proposed algorithm can be applied to each pair of submatrices concurrently and thus parallelizes an important step in solving the least squares problems.