Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing

We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access NUMA high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a two fold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.

[1]  Jack J. Dongarra,et al.  Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures , 2010, IEEE Transactions on Parallel and Distributed Systems.

[2]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures , 2009, IWOMP.

[3]  Karl-Filip Faxén,et al.  Wool-A work stealing library , 2008, CARN.

[4]  Emmanuel Agullo,et al.  Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures , 2010, VECPAR.

[5]  Emmanuel Jeannot,et al.  Symbolic mapping and allocation for the Cholesky factorization on NUMA machines , 2013, Int. J. High Perform. Comput. Appl..

[6]  Jack J. Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..

[7]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[8]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks , 2009, International Journal of Parallel Programming.

[9]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[10]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[11]  Quan Chen,et al.  LAWS: locality-aware work-stealing for multi-socket multi-core architectures , 2014, ICS '14.

[12]  Vladimir Vlassov,et al.  Locality-Aware Task Scheduling and Data Distribution on NUMA Systems , 2013, IWOMP.

[13]  David E. Keyes,et al.  Exaflop/s: The why and the how , 2011 .

[14]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[15]  J. Dongarra,et al.  Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures LAPACK Working Note # 209 , 2008 .

[16]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[17]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[18]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[19]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[20]  Lars Karlsson,et al.  Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures , 2011, Parallel Comput..

[21]  Karine Heydemann,et al.  Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages , 2014, ACM Trans. Archit. Code Optim..

[22]  Samuel Thibault,et al.  Structuring the execution of OpenMP applications for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[23]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[24]  David E. Keyes,et al.  Pipelining Computational Stages of the Tomographic Reconstructor for Multi-Object Adaptive Optics on a Multi-GPU System , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[26]  Nicholas J. Higham,et al.  Accuracy and stability of numerical algorithms, Second Edition , 2002 .

[27]  David E. Keyes,et al.  Communication Complexity of the Fast Multipole Method and its Algebraic Variants , 2014, Supercomput. Front. Innov..

[28]  Robert A. van de Geijn,et al.  Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[29]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.