Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes

The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.

[1]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[2]  Timothy A. Davis,et al.  Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization , 2011, TOMS.

[3]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[4]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[5]  Jennifer A. Scott,et al.  Design of a Multicore Sparse Cholesky Factorization Using DAGs , 2010, SIAM J. Sci. Comput..

[6]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[7]  Barry W. Peyton,et al.  Progress in Sparse Matrix Methods for Large Linear Systems On Vector Supercomputers , 1987 .

[8]  Xiaoye S. Li Evaluation of Sparse LU Factorization and Triangular Solution on Multicore Platforms , 2008, VECPAR.

[9]  Joseph W. H. Liu The role of elimination trees in sparse factorization , 1990 .

[10]  Pascal Hénon,et al.  On finding approximate supernodes for an efficient block-ILU(k , 2008, Parallel Comput..

[11]  James Demmel,et al.  A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..

[12]  Pierre Ramet,et al.  Dynamic scheduling for sparse direct solver on NUMA architectures , 2008 .

[13]  J. Roman,et al.  On finding approximate supernodes for an efficient ILU(k) factorization , 2006 .

[14]  Alfredo Buttari,et al.  Fine-Grained Multithreading for the Multifrontal QR Factorization of Sparse Matrices , 2013, SIAM J. Sci. Comput..

[15]  Pierre Ramet,et al.  Fine Grain Scheduling for Sparse Solver on Manycore Architectures , 2012 .

[16]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[17]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[18]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Cleve Ashcraft,et al.  A Fan-In Algorithm for Distributed Sparse Numerical Factorization , 1990, SIAM J. Sci. Comput..

[20]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Olaf Schenk,et al.  Solving unsymmetric sparse systems of linear equations with PARDISO , 2002, Future Gener. Comput. Syst..

[22]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[23]  Chenhan D. Yu,et al.  A CPU-GPU hybrid approach for the unsymmetric multifrontal method , 2011, Parallel Comput..

[24]  Robert A. van de Geijn,et al.  The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations , 2012, J. Parallel Distributed Comput..

[25]  Asim YarKhan,et al.  Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[26]  Stanimire Tomov,et al.  One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators , 2012, ICCS.

[27]  Roger Grimes,et al.  Multifrontal Computations on GPUs and Their Multi-core Hosts , 2010, VECPAR.

[28]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[29]  Anamitra R. Choudhury,et al.  Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[30]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[31]  John K. Reid,et al.  The Multifrontal Solution of Indefinite Sparse Symmetric Linear , 1983, TOMS.

[32]  Emmanuel Jeannot,et al.  Compact DAG representation and its symbolic scheduling , 1999, J. Parallel Distributed Comput..