A New High Performance and Scalable SVD algorithm on Distributed Memory Systems

This paper introduces a high performance implementation of \texttt{Zolo-SVD} algorithm on distributed memory systems, which is based on the polar decomposition (PD) algorithm via the Zolotarev's function (\texttt{Zolo-PD}), originally proposed by Nakatsukasa and Freund [SIAM Review, 2016]. Our implementation highly relies on the routines of ScaLAPACK and therefore it is portable. Compared with the other PD algorithms such as the QR-based dynamically weighted Halley method (\texttt{QDWH-PD}), \texttt{Zolo-PD} is naturally parallelizable and has better scalability though performs more floating-point operations. When using many processes, \texttt{Zolo-PD} is usually 1.20 times faster than \texttt{QDWH-PD} algorithm, and \texttt{Zolo-SVD} can be about two times faster than the ScaLAPACK routine \texttt{\texttt{PDGESVD}}. These numerical experiments are performed on Tianhe-2 supercomputer, one of the fastest supercomputers in the world, and the tested matrices include some sparse matrices from particular applications and some randomly generated dense matrices with different dimensions. Our \texttt{QDWH-SVD} and \texttt{Zolo-SVD} implementations are freely available at this https URL.

[1]  Stanley C. Eisenstat,et al.  A Divide-and-Conquer Algorithm for the Bidiagonal SVD , 1995, SIAM J. Matrix Anal. Appl..

[2]  James Demmel,et al.  Accurate Singular Values of Bidiagonal Matrices , 1990, SIAM J. Sci. Comput..

[3]  W. Marsden I and J , 2012 .

[4]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[5]  Rui Ralha,et al.  One-sided reduction to bidiagonal form , 2003 .

[6]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[7]  Nicholas J. Higham,et al.  Functions of matrices - theory and computation , 2008 .

[8]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[9]  Lukas Krämer,et al.  Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations , 2011, Parallel Comput..

[10]  Jesse L. Barlow,et al.  Block and Parallel Versions of One-Sided Bidiagonalization , 2007, SIAM J. Matrix Anal. Appl..

[11]  M. A. Iwen,et al.  A Distributed and Incremental SVD Algorithm for Agglomerative Data Analysis on Large Networks , 2016, SIAM J. Matrix Anal. Appl..

[12]  A Marek,et al.  The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science , 2014, Journal of physics. Condensed matter : an Institute of Physics journal.

[13]  Bruno Lang,et al.  Efficient parallel reduction to bidiagonal form , 1999, Parallel Comput..

[14]  Nicholas J. Higham,et al.  Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..

[15]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[16]  Nicholas J. Higham,et al.  Parallel Singular Value Decomposition via the Polar Decomposition , 2006 .

[17]  Nicholas J. Higham,et al.  A NEW PARALLEL ALGORITHM FOR COMPUTING THE SINGULAR-VALUE DECOMPOSITION , 1994 .

[18]  Hongyuan Zha,et al.  On Updating Problems in Latent Semantic Indexing , 1997, SIAM J. Sci. Comput..

[19]  Bruno Lang,et al.  Computing the Bidiagonal SVD Using Multiple Relatively Robust Representations , 2006, SIAM J. Matrix Anal. Appl..

[20]  Jack J. Dongarra,et al.  High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures , 2013, TOMS.

[21]  Piotr Luszczek,et al.  An improved parallel singular value algorithm and its implementation for multicore hardware , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Ben Silver,et al.  Elements of the theory of elliptic functions , 1990 .

[23]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[24]  B. Moore Principal component analysis in linear systems: Controllability, observability, and model reduction , 1981 .

[25]  Roland W. Freund,et al.  Computing Fundamental Matrix Decompositions Accurately via the Matrix Sign Function in Two Iterations: The Power of Zolotarev's Functions , 2016, SIAM Rev..

[26]  Harold Hotelling,et al.  Simplified calculation of principal components , 1936 .

[27]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[28]  Zhaojun Bai,et al.  Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition , 2010, SIAM J. Matrix Anal. Appl..

[29]  Hongyuan Zha,et al.  Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process with Applications , 1999, SIAM J. Sci. Comput..

[30]  David E. Keyes,et al.  A High Performance QDWH-SVD Solver Using Hardware Accelerators , 2016, ACM Trans. Math. Softw..

[31]  Z. Drmač,et al.  A new stable bidiagonal reduction algorithm , 2005 .

[32]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[33]  B. AfeArd CALCULATING THE SINGULAR VALUES AND PSEUDOINVERSE OF A MATRIX , 2022 .

[34]  David E. Keyes,et al.  High Performance Polar Decomposition on Distributed Memory Systems , 2016, Euro-Par.

[35]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .