Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems

This paper describes how to leverage a task-based implementation of the polar decomposition on massively parallel systems using the PaRSEC dynamic runtime system. Based on a formulation of the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, our novel implementation reduces data traffic while exploiting high concurrency from the underlying hardware architecture. First, we replace the most time-consuming classical QR factorization phase with a new hierarchical variant, customized for the specific structure of the matrix during the QDWH iterations. The newly developed hierarchical QR for QDWH exploits not only the matrix structure, but also shortens the length of the critical path to maximize hardware occupancy. We then deploy Pa RSEC to seamlessly orchestrate, pipeline, and track the data dependencies of the various linear algebra building blocks involved during the iterative QDWH algorithm. PaRSEC enables to overlap communications with computations thanks to its asynchronous scheduling of fine-grained computational tasks. It employs look-ahead techniques to further expose parallelism, while actively pursuing the critical path. In addition, we identify synergistic opportunities between the task-based QDWH algorithm and the PaRSEC framework. We exploit them during the hierarchical QR factorization to enforce a locality-aware task execution. The latter feature permits to minimize the expensive inter-node communication, which represents one of the main bottlenecks for scaling up applications on challenging distributed-memory systems. We report numerical accuracy and performance results using well and ill-conditioned matrices. The benchmarking campaign reveals up to 2X performance speedup against the existing state-of-the-art implementation for the polar decomposition on 36,864 cores.

[1]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[2]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[3]  Emmanuel Agullo,et al.  Tile QR factorization with parallel panel processing for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Yves Robert,et al.  Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[5]  Justus A. Calvin,et al.  Scalable task-based algorithm for multiplication of block-rank-sparse matrices , 2015, IA3@SC.

[6]  N. Higham Computing the polar decomposition with applications , 1986 .

[7]  Hatem Ltaief,et al.  A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems , 2019, ACM Trans. Math. Softw..

[8]  James Demmel,et al.  Communication-avoiding parallel and sequential QR factorizations , 2008, ArXiv.

[9]  Zhaojun Bai,et al.  Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition , 2010, SIAM J. Matrix Anal. Appl..

[10]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[11]  David E. Keyes,et al.  A High Performance QDWH-SVD Solver Using Hardware Accelerators , 2016, ACM Trans. Math. Softw..

[12]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[13]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[14]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Emmanuel Jeannot,et al.  Compact DAG representation and its symbolic scheduling , 1999, J. Parallel Distributed Comput..

[16]  David E. Keyes,et al.  Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.

[17]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[18]  Nicholas J. Higham,et al.  Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD , 2013, SIAM J. Sci. Comput..

[19]  David E. Keyes,et al.  Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures , 2017, ISC.

[20]  Thomas Hérault,et al.  Hierarchical QR factorization algorithms for multi-core clusters , 2013, Parallel Comput..

[21]  N. Higham Estimating the matrixp-norm , 1992 .

[22]  Jerome A. Goldstein,et al.  Linear algebra and quantum chemistry , 1991 .

[23]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[24]  Jack J. Dongarra,et al.  Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  David E. Keyes,et al.  Massively Parallel Polar Decomposition on Distributed-memory Systems , 2019, TOPC.

[26]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[27]  Hatem Ltaief,et al.  Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures , 2018, IEEE Transactions on Parallel and Distributed Systems.

[28]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[29]  David E. Keyes,et al.  High Performance Polar Decomposition on Distributed Memory Systems , 2016, Euro-Par.

[30]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[31]  Jack Dongarra,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[32]  Qingyu Meng,et al.  Investigating applications portability with the uintah DAG-based runtime system on petascale supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[33]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[34]  Yves Robert,et al.  Tiled QR factorization algorithms , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[35]  Itzhack Bar-itzhack,et al.  Iterative Optimal Orthogonalization of the Strapdown Matrix , 1975, IEEE Transactions on Aerospace and Electronic Systems.

[36]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.