Exploiting a Parametrized Task Graph Model for the Parallelization of a Sparse Direct Multifrontal Solver

The advent of multicore processors requires to reconsider the design of high performance computing libraries to embrace portable and effective techniques of parallel software engineering. One of the most promising approaches consists in abstracting an application as a directed acyclic graph (DAG) of tasks. While this approach has been popularized for shared memory environments by the OpenMP 4.0 standard where dependencies between tasks are automatically inferred, we investigate an alternative approach, capable of describing the DAG of task in a distributed setting, where task dependencies are explicitly encoded. So far this approach has been mostly used in the case of algorithms with a regular data access pattern and we show in this study that it can be efficiently applied to a higly irregular numerical algorithm such as a sparse multifrontal QR method. We present the resulting implementation and discuss the potential and limits of this approach in terms of productivity and effectiveness in comparison with more common parallelization techniques. Although at an early stage of development, preliminary results show the potential of the parallel programming model that we investigate in this work.

[1]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[2]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Xavier Lacoste,et al.  Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems. (Ordonnancement et optimisations mémoire pour un solveur creux par méthodes directes sur des machines hétérogènes) , 2015 .

[4]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[5]  Anil V. Rao,et al.  GPOPS-II , 2014, ACM Trans. Math. Softw..

[6]  Jack Dongarra,et al.  A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs , 2012 .

[7]  George Bosilca,et al.  Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach , 2012 .

[8]  Robert Schreiber,et al.  A New Implementation of Sparse Gaussian Elimination , 1982, TOMS.

[9]  Emmanuel Agullo,et al.  Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems , 2016, ACM Trans. Math. Softw..

[10]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[11]  Robert A. van de Geijn,et al.  The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations , 2012, J. Parallel Distributed Comput..

[12]  Patrick Amestoy,et al.  Multifrontal QR Factorization in a Multiprocessor Environment , 1996, Numer. Linear Algebra Appl..

[13]  Victor Eijkhout,et al.  A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling , 2014, ACM Trans. Math. Softw..

[14]  Michel Cosnard,et al.  Automatic task graph generation techniques , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[15]  John K. Reid,et al.  The Multifrontal Solution of Indefinite Sparse Symmetric Linear , 1983, TOMS.

[16]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[17]  Jack Dongarra,et al.  Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA , 2011 .

[18]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[19]  Alfredo Buttari,et al.  Fine-Grained Multithreading for the Multifrontal QR Factorization of Sparse Matrices , 2013, SIAM J. Sci. Comput..

[20]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[21]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[22]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[23]  Emmanuel Agullo,et al.  Tile QR factorization with parallel panel processing for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[24]  Timothy A. Davis,et al.  Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization , 2011, TOMS.