Introduction of shared-memory parallelism in a distributed-memory multifrontal solver

We study the adaptation of a parallel distributed-memory solver towards a shared-memory code, targeting multi-core architectures. The advantage of adapting the code over a new design is to fully benefit from its numerical kernels, range of functionalities and internal features. Although the studied code is a direct solver for sparse systems of linear equations, the approaches described in this paper are general and could be useful to a wide range of applications. We show how existing parallel algorithms can be adapted to an OpenMP environment while, at the same time, also relying on third-party optimized multithreaded libraries. We propose simple approaches to take advantage of NUMA architectures, and original optimizations to limit thread synchronization costs. For each point, the performance gains are analyzed in detail on test problems from various application areas.

[1]  Pascal Hénon,et al.  PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions , 2000, IPDPS Workshops.

[2]  Gil Shklarski,et al.  Parallel unsymmetric-pattern multifrontal sparse LU with column preordering , 2008, TOMS.

[3]  Jennifer A. Scott,et al.  Design of a Multicore Sparse Cholesky Factorization Using DAGs , 2010, SIAM J. Sci. Comput..

[4]  Patrick Amestoy,et al.  Hybrid scheduling for the parallel solution of linear systems , 2006, Parallel Comput..

[5]  Joseph W. H. Liu,et al.  On the storage requirement in the out-of-core multifrontal method for sparse factorization , 1986, TOMS.

[6]  Alex Pothen,et al.  A Mapping Algorithm for Parallel Sparse Cholesky Factorization , 1993, SIAM J. Sci. Comput..

[7]  Anshul Gupta A Shared- and distributed-memory parallel general sparse direct solver , 2007, Applicable Algebra in Engineering, Communication and Computing.

[8]  O. Schenk,et al.  ON FAST FACTORIZATION PIVOTING METHODS FOR SPARSE SYMMETRI C INDEFINITE SYSTEMS , 2006 .

[9]  Xiaoye S. Li Evaluation of SuperLU on multicore architectures , 2008 .

[10]  Jean-Yves L'Excellent,et al.  Some Experiments and Issues to Exploit Multicore Parallelism in a Distributed-Memory Parallel Sparse Direct Solver , 2010 .

[11]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[12]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[13]  Timothy A. Davis,et al.  Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization , 2011, TOMS.

[14]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[15]  Patrick R. Amestoy,et al.  Multifrontal parallel distributed symmetric and unsymmetric solvers , 2000 .

[16]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[17]  Dror Irony,et al.  Parallel and fully recursive multifrontal sparse Cholesky , 2004, Future Gener. Comput. Syst..

[18]  Joseph W. H. Liu,et al.  The Multifrontal Method for Sparse Matrix Solution: Theory and Practice , 1992, SIAM Rev..

[19]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[20]  James Demmel,et al.  SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems , 2003, TOMS.

[21]  Al Geist,et al.  Task scheduling for parallel sparse Cholesky factorization , 1990, International Journal of Parallel Programming.