Massively Parallel Polar Decomposition on Distributed-memory Systems

We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions—introduced by Zolotarev (ZOLO) in 1877—this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3× speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.

[1]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[2]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[3]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[4]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[5]  David E. Keyes,et al.  A High Performance QDWH-SVD Solver Using Hardware Accelerators , 2016, ACM Trans. Math. Softw..

[6]  Walter Gander Algorithms for the polar decomposition , 1989 .

[7]  Zhaojun Bai,et al.  Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition , 2010, SIAM J. Matrix Anal. Appl..

[8]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[9]  N. Higham Functions of Matrices: Theory and Computation (Other Titles in Applied Mathematics) , 2008 .

[10]  David E. Keyes,et al.  High Performance Polar Decomposition on Distributed Memory Systems , 2016, Euro-Par.

[11]  Hatem Ltaief,et al.  Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures , 2018, IEEE Transactions on Parallel and Distributed Systems.

[12]  Krystyna Zietak,et al.  Numerical Behaviour of Higham's Scaled Method for Polar Decomposition , 2004, Numerical Algorithms.

[13]  Nicholas J. Higham,et al.  Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..

[14]  Alan J. Laub,et al.  On Scaling Newton's Method for Polar Decomposition and the Matrix Sign Function , 1990, 1990 American Control Conference.

[15]  Roland W. Freund,et al.  Computing Fundamental Matrix Decompositions Accurately via the Matrix Sign Function in Two Iterations: The Power of Zolotarev's Functions , 2016, SIAM Rev..

[16]  Nicholas J. Higham,et al.  A Parallel Algorithm for Computing the Polar Decomposition , 1994, Parallel Comput..

[17]  Hatem Ltaief,et al.  A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems , 2019, ACM Trans. Math. Softw..

[18]  Nicholas J. Higham,et al.  Backward Stability of Iterations for Computing the Polar Decomposition , 2012, SIAM J. Matrix Anal. Appl..

[19]  Nicholas J. Higham,et al.  Functions of matrices - theory and computation , 2008 .

[20]  K. Zietak,et al.  Approximation of Matrices and a Family of Gander Methods for Polar Decomposition , 2006 .

[21]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[22]  John Shalf,et al.  Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[23]  W. Gander On Halley's Iteration Method , 1985 .

[24]  Ralph Byers,et al.  A New Scaling for Newton's Iteration for the Polar Decomposition and its Backward Stability , 2008, SIAM J. Matrix Anal. Appl..