Performance of the Parallel One-Sided Block Jacobi SVD Algorithm on a Modern Distributed-Memory Parallel Computer

The one-sided block Jacobi (OSBJ) method is known to be an efficient algorithm for computing the singular value decomposition. In this paper, we evaluate the performance of the most recent variant of the OSBJ method, the one with dynamic ordering and variable blocking, on the Fujitsu FX10 parallel computer. By analyzing the performance results, we identified two bottlenecks, namely, weight computation for ordering and diagonalization of \(2\times 2\) block matrices. To resolve the problem, we propose new implementations for these two tasks. Experimental results show that they are effective and can achieve speedup of up to 1.6 times in total. As a result, our OSBJ solver can compute the SVD of matrices of order 2048 to 8192 on 12 to 48 nodes of FX10 more than three times faster than ScaLAPACK PDGESVD.