Scheduling a Parallel Sparse Direct Solver to Multiple GPUs

We present a sparse direct solver using multi-level task scheduling on a modern heterogeneous compute node consisting of a multi-core host processor and multiple GPU accelerators. Our direct solver is based on the multifrontal method, which is characterized by exploiting dense sub problems (fronts) related in an assembly tree. Critical to high performance of the solver is dynamic task allocation to account for the asymmetric performance of heterogeneous devices. Device-specific tasks are generated and adapted to different devices on the course of multifrontal factorization using multi-level matrix partitioning. Large blocks are used to provide coarse grain tasks for fast devices, and some of the blocks are recursively partitioned to supply fine-grained tasks for the next available (slower) devices. Experimental results are obtained from particular problems arising from a high order Finite Element Method.

[1]  Robert A. van de Geijn,et al.  Managing the complexity of lookahead for LU factorization with pivoting , 2010, SPAA '10.

[2]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[3]  Victor Eijkhout,et al.  A Parallel Sparse Direct Solver via Hierarchical DAG Scheduling , 2014, ACM Trans. Math. Softw..

[4]  Jack J. Dongarra,et al.  An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs , 2010, PARA.

[5]  Barna A. Szabó,et al.  The p- and h-p versions of the finite element methods in solid mechanics , 1990 .

[6]  Iain S. Duff,et al.  Parallel implementation of multifrontal schemes , 1986, Parallel Comput..

[7]  Joseph W. H. Liu,et al.  The Multifrontal Method for Sparse Matrix Solution: Theory and Practice , 1992, SIAM Rev..

[8]  Victor Eijkhout,et al.  Dense Matrix Computation on a Heterogenous Architecture: A Block Synchronous Approach , 2012 .

[9]  Robert A. van de Geijn,et al.  Satisfying your dependencies with SuperMatrix , 2007, 2007 IEEE International Conference on Cluster Computing.

[10]  Maciej Paszyński,et al.  Computing with hp-ADAPTIVE FINITE ELEMENTS: Volume II Frontiers: Three Dimensional Elliptic and Maxwell Problems with Applications , 2007 .

[11]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[12]  Ivo Babuska,et al.  The p and h-p Versions of the Finite Element Method, Basic Principles and Properties , 1994, SIAM Rev..

[13]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[14]  I. Duff,et al.  On George’s Nested Dissection Method , 1976 .

[15]  Jack J. Dongarra,et al.  Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting , 2014, Concurr. Comput. Pract. Exp..

[16]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[17]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[18]  Jack J. Purdum,et al.  C programming guide , 1983 .

[19]  Victor Eijkhout,et al.  Sparse direct factorizations through unassembled hyper-matrices , 2010 .