Algorithm 979

To exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries, such as linear algebra package (LAPACK), commonly implement operations as blocked algorithms. However, to achieve near-optimal performance with such algorithms, significant tuning is required. In contrast, recursive algorithms are virtually tuning free and attain similar performance. In this article, we first analyze and compare blocked and recursive algorithms in terms of performance and then introduce recursive LAPACK (ReLAPACK), an open-source library of recursive algorithms to seamlessly replace many of LAPACK’s blocked algorithms. In most scenarios, ReLAPACK outperforms reference LAPACK and in many situations improves upon the performance of optimized libraries.

[1]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[2]  R. C. Whaley,et al.  Empirically tuning LAPACK’s blocking factor for increased performance , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[3]  Jack Dongarra,et al.  LAPACK Working Note 19: Evaluating Block Algorithm Variants in LAPACK , 1990 .

[4]  Stefan Dessloch Euro-Par 2003 Parallel Processing: 9th International Euro-Par Conference Klagenfurt, Austria, August 26-29, 2003 Proceedings , 2003, Lecture Notes in Computer Science.

[5]  Isak Jonsson,et al.  RECSY - A High Performance Library for Sylvester-Type Matrix Equations , 2003, Euro-Par.

[6]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[7]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[8]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[9]  Jack J. Dongarra,et al.  Anatomy of a globally recursive embedded LINPACK benchmark , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[10]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[11]  Para,et al.  Applied Parallel Computing Large Scale Scientific and Industrial Problems , 1998, Lecture Notes in Computer Science.

[12]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[13]  Verdi March,et al.  Data mining analysis to validate performance tuning practices for HPL , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[14]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[15]  Jerzy Wasniewski,et al.  Recursive Version of LU Decomposition , 2000, NAA.

[16]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[17]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[18]  Jack J. Dongarra,et al.  Reducing the Amount of Pivoting in Symmetric Indefinite Systems , 2011, PPAM.

[19]  Paolo Bientinesi,et al.  Knowledge-Based Automatic Generation of Partitioned Matrix Expressions , 2011, CASC.

[20]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[21]  Robert A. van de Geijn,et al.  Families of algorithms related to the inversion of a Symmetric Positive Definite matrix , 2008, TOMS.

[22]  Isak Jonsson,et al.  Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms , 1998, PARA.

[23]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[24]  Hector Zenil,et al.  Applied Parallel Computing , 2004, IEEE Distributed Syst. Online.

[25]  Jack J. Dongarra,et al.  Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.

[26]  Jack J. Dongarra,et al.  Evaluating Block Algorithm Variants in LAPACK , 1989, PPSC.

[27]  N. Higham,et al.  Stability of methods for matrix inversion , 1992 .

[28]  Jeremy Du Croz,et al.  Factorizations of Band Matrices Using Level 3 BLAS , 1990, CONPAR.

[29]  Isak Jonsson,et al.  Recursive blocked algorithms for solving triangular systems—Part II: two-sided and generalized Sylvester and Lyapunov matrix equations , 2002, TOMS.

[30]  James Demmel,et al.  Communication-Avoiding Symmetric-Indefinite Factorization , 2014, SIAM J. Matrix Anal. Appl..

[31]  Fred G. Gustavson,et al.  LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[32]  Tor Sørevik,et al.  Applied Parallel Computing. New Paradigms for HPC in Industry and Academia , 2001, Lecture Notes in Computer Science.

[33]  Fred G. Gustavson,et al.  Recursive Formulation of Cholesky Algorithm in Fortran 90 , 1998, PARA.

[34]  Isak Jonsson,et al.  Recursive blocked algorithms for solving triangular systems—Part I: one-sided and coupled Sylvester-type matrix equations , 2002, TOMS.

[35]  Jack Dongarra,et al.  LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860 , 1990 .

[36]  Erik Elmroth,et al.  SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .