QR factorization of a dense matrix on a shared-memory multiprocessor

A new algorithm for computing an orthogonal decomposition of a rectangular m × n matrix A on a shared-memory parallel computer is described. The algorithm uses Givens rotations, and has the feature that its synchronization cost is low. In particular, for a multiprocessor having p processors, an analysis of the algorithm shows that this cost is O(n2/p) if m/p ⪰ n, and O(mn/p2) of m/p <. Note that in the latter case, the synchronization cost is smaller than O(n2/p). Therefore, the synchronization cost of the algorithm proposed in this article is bounded by O(n2/p) when m ⪰ n. This is important for machines where synchronization cost is high, and when m⪢n. Analysis and experiments show that the algorithm is effective in balancing the load and producing high efficiency (speedup).