Managing the complexity of lookahead for LU factorization with pivoting

We describe parallel implementations of LU factorization with pivoting for multicore architectures. Implementations that differ in two different dimensions are discussed: (1) using classical partial pivoting versus recently proposed incremental pivoting and (2) extracting parallelism only within the Basic Linear Algebra Subprograms versus building and scheduling a directed acyclic graph of tasks. Performance comparisons are given on two different systems.

[1]  Robert A. van de Geijn,et al.  Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[3]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.

[4]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[5]  David A. Padua,et al.  Programming with tiles , 2008, PPOPP.

[6]  Robert A. van de Geijn,et al.  Updating an LU Factorization with Pivoting , 2008, TOMS.

[7]  P. Strazdins A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization , 1998 .

[8]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[9]  James Demmel,et al.  Communication Avoiding Gaussian elimination , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  James Demmel,et al.  An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination , 1997, SIAM J. Matrix Anal. Appl..

[11]  Robert A. van de Geijn,et al.  An API for Manipulating Matrices Stored by Blocks ∗ Tze Meng Low , 2004 .

[12]  Ernie Chan,et al.  Runtime Data Flow Scheduling of Matrix Computations FLAME Working Note # 39 , 2009 .

[13]  Robert A. van de Geijn,et al.  Representing linear algebra algorithms in code: the FLAME application program interfaces , 2005, TOMS.

[14]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[15]  Robert A. van de Geijn,et al.  SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[16]  Fred G. Gustavson,et al.  New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms , 2000, The Architecture of Scientific Software.

[17]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[18]  Katherine A. Yelick,et al.  Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[19]  Robert A. van de Geijn,et al.  Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[20]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[21]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[22]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[23]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[24]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[25]  Apostolos Gerasoulis,et al.  Scheduling Linear Algebra Parallel Algorithms on MIMD Architectures , 1989, PPSC.

[26]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[27]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[28]  Cliff Addison,et al.  OpenMP issues arising in the development of parallel BLAS and LAPACK libraries , 2003, Sci. Program..

[29]  Jack Dongarra,et al.  LAPACK Users' guide (third ed.) , 1999 .