Recursive Algorithms for Dense Linear Algebra: The ReLAPACK Collection

To exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries such as LAPACK commonly implement operations as blocked algorithms. However, to achieve next-to-optimal performance with such algorithms, significant tuning is required. On the other hand, recursive algorithms are virtually tuning free, and yet attain similar performance. In this paper, we first analyze and compare blocked and recursive algorithms in terms of performance, and then introduce ReLAPACK, an open-source library of recursive algorithms to seamlessly replace most of LAPACK's blocked algorithms. In many scenarios, ReLAPACK clearly outperforms reference LAPACK, and even improves upon the performance of optimizes libraries.

[1]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[2]  Erik Hagersten,et al.  THROOM — Supporting POSIX Multithreaded Binaries on a Cluster , 2003 .

[3]  Fred G. Gustavson,et al.  LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[4]  R. C. Whaley,et al.  Empirically tuning LAPACK’s blocking factor for increased performance , 2008, 2008 International Multiconference on Computer Science and Information Technology.

[5]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009 .

[6]  Jack Dongarra,et al.  LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860 , 1990 .

[7]  Fred G. Gustavson,et al.  Recursive Formulation of Cholesky Algorithm in Fortran 90 , 1998, PARA.

[8]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[9]  Yuefan Deng,et al.  Applied Parallel Computing , 2012 .

[10]  Isak Jonsson,et al.  RECSY - A High Performance Library for Sylvester-Type Matrix Equations , 2003, Euro-Par.

[11]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[12]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[13]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[14]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[15]  Jack J. Dongarra,et al.  Evaluating Block Algorithm Variants in LAPACK , 1989, PPSC.

[16]  Para,et al.  Applied Parallel Computing Large Scale Scientific and Industrial Problems , 1998, Lecture Notes in Computer Science.

[17]  N. Higham,et al.  Stability of methods for matrix inversion , 1992 .

[18]  Jack J. Dongarra,et al.  Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.

[19]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[20]  Verdi March,et al.  Data mining analysis to validate performance tuning practices for HPL , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[21]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[22]  Jerzy Wasniewski,et al.  Recursive Version of LU Decomposition , 2000, NAA.

[23]  Robert A. van de Geijn,et al.  Families of algorithms related to the inversion of a Symmetric Positive Definite matrix , 2008, TOMS.

[24]  Isak Jonsson,et al.  Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms , 1998, PARA.