论文信息 - Programming parallel dense matrix factorizations with look-ahead and OpenMP

Programming parallel dense matrix factorizations with look-ahead and OpenMP

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of basic linear algebra subroutines (BLAS). The proposed approach is also different from the more sophisticated runtime-based implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of linear algebra package (LAPACK) functionality on any multicore platform with an OpenMP-like runtime.

[1] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[2] Robert A. van de Geijn,et al. Updating an LU Factorization with Pivoting , 2008, TOMS.

[3] Pavan Balaji,et al. A Review of Lightweight Thread Approaches for High Performance Computing , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[4] Robert A. van de Geijn,et al. Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[5] Ed Anderson,et al. LAPACK Users' Guide , 1995 .

[6] Devang Shah,et al. Implementing Lightweight Threads , 1992, USENIX Summer.

[7] Gene H. Golub,et al. Matrix computations , 1983 .

[8] Enrique S. Quintana-Ortí,et al. A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting , 2016, IEEE Access.

[9] James Demmel,et al. Applied Numerical Linear Algebra , 1997 .

[10] Alex Brooks,et al. Argobots: A Lightweight Low-Level Threading and Tasking Framework , 2018, IEEE Transactions on Parallel and Distributed Systems.

[11] P. Strazdins. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .

[12] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[13] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[14] Pavan Balaji,et al. GLT: A Unified API for Lightweight Thread Libraries , 2017, Euro-Par.

[15] Robert A. van de Geijn,et al. Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[16] Robert A. van de Geijn,et al. Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[17] Robert A. van de Geijn,et al. The science of deriving dense linear algebra algorithms , 2005, TOMS.

[18] Julien Langou,et al. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[19] Bruno Lang,et al. Efficient parallel reduction to bidiagonal form , 1999, Parallel Comput..

[20] Jesús Labarta,et al. Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[21] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[22] Pavan Balaji,et al. GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[23] G Van ZeeField,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015 .

[24] Enrique S. Quintana-Ortí,et al. Two-Sided Reduction to Compact Band Forms with Look-Ahead , 2017, ArXiv.

[25] Rafael Mayo,et al. Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors , 2015, Cluster Computing.

[26] Tze Meng Low,et al. The BLIS Framework , 2016 .

[27] Adrián Castelló,et al. On the adequacy of lightweight thread approaches for high-level parallel programming models , 2018, Future Gener. Comput. Syst..

[28] Christian H. Bischof,et al. Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[29] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.