Task-Based Sparse Hybrid Linear Solver for Distributed Memory Heterogeneous Architectures

Heterogeneity is emerging as one of the most challenging characteristics of today’s parallel environments. However, not many fully-featured advanced numerical, scientific libraries have been ported on such architectures. In this paper, we propose to extend a sparse hybrid solver for handling distributed memory heterogeneous platforms. As in the original solver, we perform a domain decomposition and associate one subdomain with one MPI process. However, while each subdomain was processed sequentially (binded onto a single CPU core) in the original solver, the new solver instead relies on task-based local solvers, delegating tasks to available computing units. We show that this “MPI+task” design conveniently allows for exploiting distributed memory heterogeneous machines. Indeed, a subdomain can now be processed on multiple CPU cores (such as a whole multicore processor or a subset of the available cores) possibly enhanced with GPUs. We illustrate our discussion with the MaPHyS sparse hybrid solver relying on the PaStiX and Chameleon dense and sparse direct libraries, respectively. Interestingly, this two-level MPI+task design furthermore provides extra flexibility for controlling the number of subdomains, enhancing the numerical stability of the considered hybrid method. While the rise of heterogeneous computing has been strongly carried out by the theoretical community, this study aims at showing that it is now also possible to build complex software layers on top of runtime systems to exploit heterogeneous architectures.

[1]  Azzam Haidar,et al.  Parallel algebraic hybrid solvers for large 3D convection-diffusion problems , 2008, Numerical Algorithms.

[2]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[3]  Xavier Lacoste,et al.  Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems. (Ordonnancement et optimisations mémoire pour un solveur creux par méthodes directes sur des machines hétérogènes) , 2015 .

[4]  Jean Roman,et al.  Sparse Matrix Ordering with SCOTCH , 1997, HPCN Europe.

[5]  Emmanuel Agullo,et al.  Parallel hierarchical hybrid linear solvers for emerging computing platforms , 2011 .

[6]  Azzam Haidar,et al.  On the parallel scalability of hybrid linear solvers for large 3D problems. (Sur l'extensibilité parallèle de solveurs linéaires hybrides pour des problèmes tridimensionels de grandes tailles) , 2008 .

[7]  Patrick Amestoy,et al.  Hybrid scheduling for the parallel solution of linear systems , 2006, Parallel Comput..

[8]  Tony F. Chan,et al.  The Interface Probing Technique in Domain Decomposition , 1992, SIAM J. Matrix Anal. Appl..

[9]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[10]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[11]  Emmanuel Agullo,et al.  Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems , 2016, ACM Trans. Math. Softw..

[12]  Stojce Nakov,et al.  On the design of sparse hybrid linear solvers for modern parallel architectures. (Sur la conception de solveurs linéaires hybrides pour les architectures parallèles modernes) , 2015 .

[13]  Sivasankaran Rajamanickam,et al.  ShyLU: A Hybrid-Hybrid Solver for Multicore Platforms , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[14]  Yousef Saad,et al.  A Parallel Multistage ILU Factorization Based on a Hierarchical Graph Decomposition , 2006, SIAM J. Sci. Comput..

[15]  Layne T. Watson,et al.  Parallel scalability study of hybrid preconditioners in three dimensions , 2008, Parallel Comput..

[16]  Emmanuel Agullo,et al.  Multifrontal QR Factorization for Multicore Architectures over Runtime Systems , 2013, Euro-Par.

[17]  Pascal Hénon,et al.  A Parallel Direct/Iterative Solver Based on a Schur Complement Approach , 2008, 2008 11th IEEE International Conference on Computational Science and Engineering.

[18]  Florent Lopez,et al.  Task-based multifrontal QR solver for heterogeneous architectures. (Solveur multifrontal QR à base de tâches pour architectures hétérogènes) , 2015 .

[19]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[20]  Eduard Ayguadé,et al.  Exploiting asynchrony from exact forward recovery for DUE in iterative solvers , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Luc Giraud,et al.  Local preconditioners for two-level non-overlapping domain decomposition methods , 2001, Numer. Linear Algebra Appl..

[22]  P. Hénon,et al.  HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach , 2008 .

[23]  James Demmel,et al.  SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.

[24]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[25]  Tarek P. Mathew,et al.  Domain Decomposition Methods for the Numerical Solution of Partial Differential Equations , 2008, Lecture Notes in Computational Science and Engineering.

[26]  Alfredo Buttari,et al.  Fine-Grained Multithreading for the Multifrontal QR Factorization of Sparse Matrices , 2013, SIAM J. Sci. Comput..

[27]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[28]  Xiaoye S. Li,et al.  Factorization-based sparse solvers and preconditioners , 2009 .

[29]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[30]  Ichitaro Yamazaki,et al.  On Techniques to Improve Robustness and Scalability of a Parallel Hybrid Linear Solver , 2010, VECPAR.

[31]  George Bosilca,et al.  Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[32]  Jack Dongarra,et al.  Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .

[33]  Pascal Hénon,et al.  PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems , 2002, Parallel Comput..

[34]  Azzam Haidar,et al.  Using multiple levels of parallelism to enhance the performance of domain decomposition solvers , 2010, Parallel Comput..

[35]  Emmanuel Agullo,et al.  Task-based Conjugate-Gradient for multi-GPUs platforms , 2012 .