SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗

Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article reviews and details some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today's memory-tiered computer systems. Recursion allows for efficient utilization of a memory hierarchy and generalizes existing fixed blocking by introducing automatic variable blocking that has the potential of matching every level of a deep memory hierarchy. Novel recursive blocked algorithms offer new ways to compute factorizations such as Cholesky and QR and to solve matrix equations. In fact, the whole gamut of existing dense linear algebra factorization is beginning to be reexamined in view of the recursive paradigm. Use of recursion has led to using new hybrid data structures and optimized superscalar kernels. The results we survey include new algorithms and library software implementations for level 3 kernels, matrix factorizations, and the solution of general systems of linear equations and several common matrix equations. The software implementations we survey are robust and show impressive performance on today's high performance computing systems.

[1]  Isak Jonsson,et al.  Parallel Triangular Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking , 2000, PARA.

[2]  F. G. Gustavson,et al.  High-performance linear algebra algorithms using new generalized data structures for matrices , 2003, IBM J. Res. Dev..

[3]  V. Strassen Gaussian elimination is not optimal , 1969 .

[4]  Steve Carr,et al.  Compiler blockability of dense matrix factorizations , 1997, TOMS.

[5]  Dror Irony,et al.  Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky , 2002, International Conference on Computational Science.

[6]  Keshav Pingali,et al.  Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[7]  Para,et al.  Applied Parallel Computing Large Scale Scientific and Industrial Problems , 1998, Lecture Notes in Computer Science.

[8]  Erik Elmroth,et al.  High-Performance Library Software for QR Factorization , 2000, PARA.

[9]  Isak Jonsson,et al.  Recursive blocked algorithms for solving triangular systems—Part II: two-sided and generalized Sylvester and Lyapunov matrix equations , 2002, TOMS.

[10]  Sivan Toledo,et al.  Very Large Electronic Structure Calculations Using an Out-of-Core Filter-Diagonalization Method , 2002 .

[11]  G. Forsythe,et al.  Computer solution of linear algebraic systems , 1969 .

[12]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[13]  James Demmel,et al.  Statistical Models for Automatic Performance Tuning , 2001, International Conference on Computational Science.

[14]  Robert A. van de Geijn,et al.  Formal Linear Algebra Methods Environment (FLAME) Overview , 2000 .

[15]  Ken Kennedy,et al.  Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[16]  Antony Jameson,et al.  Solution of the Equation $AX + XB = C$ by Inversion of an $M \times M$ or $N \times N$ Matrix , 1968 .

[17]  Jack J. Dongarra,et al.  A proposal for a set of level 3 basic linear algebra subprograms , 1987, SGNM.

[18]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[19]  Bo Kågström,et al.  Distributed and Shared Memory Block Algorithms for the Triangular Sylvester Equation with øperatornamesep - 1 Estimators , 1992, SIAM J. Matrix Anal. Appl..

[20]  G. Golub,et al.  A Hessenberg-Schur method for the problem AX + XB= C , 1979 .

[21]  Isak Jonsson,et al.  Minimal-storage high-performance Cholesky factorization via blocking and recursion , 2000, IBM J. Res. Dev..

[22]  B. Kågström,et al.  Generalized Schur methods with condition estimators for solving the generalized Sylvester equation , 1989 .

[23]  Y. Wang,et al.  The Influence of the Compiler on the Cost of Mathematical Software—in Particular on the Cost of Triangular Factorization , 1975, TOMS.

[24]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[25]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[26]  G. Forsythe,et al.  Computer solution of linear algebraic systems , 1969 .

[27]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[28]  Victor Eijkhout,et al.  Recursive approach in sparse matrix LU factorization , 2001, Sci. Program..

[29]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[30]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[31]  Krister Dackland,et al.  Blocked algorithms and software for reduction of a regular matrix pair to generalized Schur form , 1999, TOMS.

[32]  Fred G. Gustavson,et al.  A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[33]  Isak Jonsson,et al.  High-Performance Matrix Multiplication on the IBM SP High Node , 1998 .

[34]  Nicholas J. Higham,et al.  Perturbation theory and backward error forAX−XB=C , 1993 .

[35]  S. Hammarling Numerical Solution of the Stable, Non-negative Definite Lyapunov Equation , 1982 .

[36]  Isak Jonsson,et al.  RECSY - A High Performance Library for Sylvester-Type Matrix Equations , 2003, Euro-Par.

[37]  Mei Han An,et al.  accuracy and stability of numerical algorithms , 1991 .

[38]  BO K Agstr,et al.  A GENERALIZED STATE-SPACE APPROACH FOR THE ADDITIVE DECOMPOSITION OF A TRANSFER MATRIX , 1992 .

[39]  Mithuna Thottethodi,et al.  Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[40]  Sivan Toledo,et al.  Out-of-Core SVD and QR Decompositions , 2001, PPSC.

[41]  James Demmel,et al.  On computing condition numbers for the nonsymmetric eigenproblem , 1993, TOMS.

[42]  C. Loan,et al.  A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[43]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[44]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[45]  Ramesh C. Agarwal,et al.  Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms , 1994, IBM J. Res. Dev..

[46]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[47]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[48]  Isak Jonsson,et al.  Parallel Two-Sided Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking , 2002, PARA.

[49]  Fred G. Gustavson,et al.  New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms , 2001, PPAM.

[50]  Bo Kågström,et al.  Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues , 1998, TOMS.

[51]  Jeremy D. Frens,et al.  QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[52]  J. Uhlig C. Forsythe and C. B. Moler, Computer Solution of Linear Algebraic Systems. (Series in Automatic Computation) XI + 148 S. Englewood Cliffs, N.J. 1967. Prentice-Hall, Inc. Preis geb. 54 s. net , 1972 .

[53]  James Demmel,et al.  Stability of block algorithms with fast level-3 BLAS , 1992, TOMS.

[54]  B. S. Garbow,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[55]  Isak Jonsson,et al.  Superscalar GEMM-based Level 3 BLAS - The On-going Evolution of a Portable and High-Performance Library , 1998, PARA.

[56]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[57]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[58]  Nicholas J. Higham,et al.  FORTRAN codes for estimating the one-norm of a real or complex matrix, with applications to condition estimation , 1988, TOMS.

[59]  E. M. Wright Solution of the equation $ze^z = a$ , 1959 .

[60]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[61]  V. Klema LINPACK user's guide , 1980 .

[62]  Bo Kågström,et al.  Computing eigenspaces with specified eigenvalues of a regular matrix pair (A, B) and condition estimation: theory, algorithms and software , 1996, Numerical Algorithms.

[63]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997 .

[64]  Alan J. Laub,et al.  Solution of the Sylvester matrix equation AXBT + CXDT = E , 1992, TOMS.

[65]  Bo Kågström,et al.  LAPACK-style algorithms and software for solving the generalized Sylvester equation and estimating the separation between regular matrix pairs , 1994, TOMS.

[66]  Anthony Skjellum,et al.  A framework for high‐performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low‐level kernels , 2002, Concurr. Comput. Pract. Exp..

[67]  Isak Jonsson,et al.  Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms , 1998, PARA.

[68]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[69]  Erik Elmroth,et al.  New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems , 1998, PARA.

[70]  Nicholas J. Higham,et al.  Perturbation Theory And Backward Error For , 1993 .

[71]  K. Chu The solution of the matrix equations AXB−CXD=E AND (YA−DZ,YC−BZ)=(E,F) , 1987 .

[72]  Alston S. Householder,et al.  Handbook for Automatic Computation , 1960, Comput. J..

[73]  Jack J. Dongarra,et al.  Software Libraries for Linear Algebra Computations on High Performance Computers , 1995, SIAM Rev..

[74]  Michael S. Warren,et al.  Skeletons from the treecode closet , 1994 .

[75]  Gene H. Golub,et al.  Matrix computations , 1983 .

[76]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance , 2001, SIAM J. Matrix Anal. Appl..

[77]  F. Gustavson,et al.  A Faster and Simpler Recursive Algorithm for the LAPACK Routine DGELS , 2001 .

[78]  G. W. Stewart,et al.  Matrix algorithms , 1998 .

[79]  Isak Jonsson,et al.  Recursive blocked algorithms for solving triangular systems—Part I: one-sided and coupled Sylvester-type matrix equations , 2002, TOMS.

[80]  Karen S. Braman,et al.  The Multishift QR Algorithm. Part II: Aggressive Early Deflation , 2001, SIAM J. Matrix Anal. Appl..

[81]  J. Cuppen A divide and conquer method for the symmetric tridiagonal eigenproblem , 1980 .

[82]  G. W. Stewart,et al.  Matrix Algorithms: Volume 1, Basic Decompositions , 1998 .

[83]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.

[84]  Bo Kågström,et al.  A Perturbation Analysis of the Generalized Sylvester Equation $(AR - LB, DR - LE) = (C, F)$ , 1994 .

[85]  Michael A. Heroux,et al.  GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[86]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[87]  Jack J. Dongarra,et al.  Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.

[88]  Michael S. Warren,et al.  Fast Parallel Tree Codes for Gravitational and Fluid Dynamical N-Body Problems , 1994, Int. J. High Perform. Comput. Appl..

[89]  Jeremy D. Frens,et al.  Language support for Morton-order matrices , 2001, PPoPP '01.

[90]  J. H. Wilkinson,et al.  Handbook for Automatic Computation. Vol II, Linear Algebra , 1973 .

[91]  Peter Poromaa Parallel Algorithms for Triangular Sylvester Equations: Design, Scheduling and Saclability Issues , 1998, PARA.

[92]  Thilo Penzl,et al.  Numerical solution of generalized Lyapunov equations , 1998, Adv. Comput. Math..

[93]  H. Sagan Space-filling curves , 1994 .

[94]  G. Stewart,et al.  Matrix Perturbation Theory , 1990 .

[95]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[96]  Isak Jonsson Analysis of Processor and Memory Utilization of Recursive Algorithms for Sylvester-Type Matrix Equations using Performance Monitoring , 2003 .

[97]  Fred G. Gustavson,et al.  Recursion leads to automatic variable blocking for dense linear-algebra algorithms , 1997, IBM J. Res. Dev..

[98]  K. A. Gallivan,et al.  Parallel Algorithms for Dense Linear Algebra Computations , 1990, SIAM Rev..

[99]  Fred G. Gustavson New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms , 2001 .