Programming parallel dense matrix factorizations with look-ahead and OpenMP

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of basic linear algebra subroutines (BLAS). The proposed approach is also different from the more sophisticated runtime-based implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of linear algebra package (LAPACK) functionality on any multicore platform with an OpenMP-like runtime.

[1]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[2]  Robert A. van de Geijn,et al.  Updating an LU Factorization with Pivoting , 2008, TOMS.

[3]  Pavan Balaji,et al.  A Review of Lightweight Thread Approaches for High Performance Computing , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[4]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[5]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[6]  Devang Shah,et al.  Implementing Lightweight Threads , 1992, USENIX Summer.

[7]  Gene H. Golub,et al.  Matrix computations , 1983 .

[8]  Enrique S. Quintana-Ortí,et al.  A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting , 2016, IEEE Access.

[9]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[10]  Alex Brooks,et al.  Argobots: A Lightweight Low-Level Threading and Tasking Framework , 2018, IEEE Transactions on Parallel and Distributed Systems.

[11]  P. Strazdins A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .

[12]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[13]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[14]  Pavan Balaji,et al.  GLT: A Unified API for Lightweight Thread Libraries , 2017, Euro-Par.

[15]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[16]  Robert A. van de Geijn,et al.  Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[17]  Robert A. van de Geijn,et al.  The science of deriving dense linear algebra algorithms , 2005, TOMS.

[18]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[19]  Bruno Lang,et al.  Efficient parallel reduction to bidiagonal form , 1999, Parallel Comput..

[20]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[21]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[22]  Pavan Balaji,et al.  GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[23]  G Van ZeeField,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015 .

[24]  Enrique S. Quintana-Ortí,et al.  Two-Sided Reduction to Compact Band Forms with Look-Ahead , 2017, ArXiv.

[25]  Rafael Mayo,et al.  Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors , 2015, Cluster Computing.

[26]  Tze Meng Low,et al.  The BLIS Framework , 2016 .

[27]  Adrián Castelló,et al.  On the adequacy of lightweight thread approaches for high-level parallel programming models , 2018, Future Gener. Comput. Syst..

[28]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[29]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.