A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems

This article presents a high-performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al. (2010) and Nakatsukasa and Higham (2013), the SVD solver relies on the polar decomposition using the QR Dynamically Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying on the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. We report performance results against state-of-the-art existing QDWH software implementations (e.g., Elemental) and their SVD extensions on large-scale distributed-memory manycore systems based on commodity Intel x86 Haswell processors and Knights Landing (KNL) architecture. The QDWH-SVD framework achieves up to 3/8-fold speedups on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well- and ill-conditioned matrices. We finally come up herein with a performance model based on these empirical results. Our QDWH-based polar decomposition and its SVD extension are freely available at https://github.com/ecrc/qdwh.git and https://github.com/ecrc/ksvd.git, respectively, and have been integrated into the Cray Scientific numerical library LibSci v17.11.1.

[1]  A. Sameh,et al.  An overview of parallel algorithms for the singular value and symmetric eigenvalue problems , 1989 .

[2]  Nicholas J. Higham,et al.  Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..

[3]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[4]  Yongchang Wang,et al.  Research and implementation of SVD in machine learning , 2017, 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS).

[5]  A Marek,et al.  The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science , 2014, Journal of physics. Condensed matter : an Institute of Physics journal.

[6]  Itzhack Bar-itzhack,et al.  Practical Comparison of Iterative Matrix Orthogonalization Algorithms , 1977, IEEE Transactions on Aerospace and Electronic Systems.

[7]  David E. Keyes,et al.  Asynchronous Task-Based Polar Decomposition on Manycore Architectures , 2016 .

[8]  Corporate The MPI Forum MPI: a message passing interface , 1993, Supercomputing '93.

[9]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[10]  Jerome A. Goldstein,et al.  Linear algebra and quantum chemistry , 1991 .

[11]  John Shalf,et al.  Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[12]  Zhaojun Bai,et al.  Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition , 2010, SIAM J. Matrix Anal. Appl..

[13]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[14]  David E. Keyes,et al.  High Performance Polar Decomposition on Distributed Memory Systems , 2016, Euro-Par.

[15]  Nicholas J. Higham,et al.  A NEW PARALLEL ALGORITHM FOR COMPUTING THE SINGULAR-VALUE DECOMPOSITION , 1994 .

[16]  N. Higham Computing the polar decomposition with applications , 1986 .

[17]  David E. Keyes,et al.  A High Performance QDWH-SVD Solver Using Hardware Accelerators , 2016, ACM Trans. Math. Softw..