论文信息 - SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗

SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗

Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article reviews and details some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today's memory-tiered computer systems. Recursion allows for efficient utilization of a memory hierarchy and generalizes existing fixed blocking by introducing automatic variable blocking that has the potential of matching every level of a deep memory hierarchy. Novel recursive blocked algorithms offer new ways to compute factorizations such as Cholesky and QR and to solve matrix equations. In fact, the whole gamut of existing dense linear algebra factorization is beginning to be reexamined in view of the recursive paradigm. Use of recursion has led to using new hybrid data structures and optimized superscalar kernels. The results we survey include new algorithms and library software implementations for level 3 kernels, matrix factorizations, and the solution of general systems of linear equations and several common matrix equations. The software implementations we survey are robust and show impressive performance on today's high performance computing systems.

[1] Isak Jonsson,et al. Parallel Triangular Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking , 2000, PARA.

[2] F. G. Gustavson,et al. High-performance linear algebra algorithms using new generalized data structures for matrices , 2003, IBM J. Res. Dev..

[3] V. Strassen. Gaussian elimination is not optimal , 1969 .

[4] Steve Carr,et al. Compiler blockability of dense matrix factorizations , 1997, TOMS.

[5] Dror Irony,et al. Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky , 2002, International Conference on Computational Science.

[6] Keshav Pingali,et al. Automatic Generation of Block-Recursive Codes , 2000, Euro-Par.

[7] Para,et al. Applied Parallel Computing Large Scale Scientific and Industrial Problems , 1998, Lecture Notes in Computer Science.

[8] Erik Elmroth,et al. High-Performance Library Software for QR Factorization , 2000, PARA.

[9] Isak Jonsson,et al. Recursive blocked algorithms for solving triangular systems—Part II: two-sided and generalized Sylvester and Lyapunov matrix equations , 2002, TOMS.

[10] Sivan Toledo,et al. Very Large Electronic Structure Calculations Using an Out-of-Core Filter-Diagonalization Method , 2002 .

[11] G. Forsythe,et al. Computer solution of linear algebraic systems , 1969 .

[12] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[13] James Demmel,et al. Statistical Models for Automatic Performance Tuning , 2001, International Conference on Computational Science.

[14] Robert A. van de Geijn,et al. Formal Linear Algebra Methods Environment (FLAME) Overview , 2000 .

[15] Ken Kennedy,et al. Transforming loops to recursion for multi-level memory hierarchies , 2000, PLDI '00.

[16] Antony Jameson,et al. Solution of the Equation $AX + XB = C$ by Inversion of an $M \times M$ or $N \times N$ Matrix , 1968 .

[17] Jack J. Dongarra,et al. A proposal for a set of level 3 basic linear algebra subprograms , 1987, SGNM.

[18] Jack Dongarra,et al. LAPACK Users' Guide, 3rd ed. , 1999 .

[19] Bo Kågström,et al. Distributed and Shared Memory Block Algorithms for the Triangular Sylvester Equation with øperatornamesep - 1 Estimators , 1992, SIAM J. Matrix Anal. Appl..

[20] G. Golub,et al. A Hessenberg-Schur method for the problem AX + XB= C , 1979 .

[21] Isak Jonsson,et al. Minimal-storage high-performance Cholesky factorization via blocking and recursion , 2000, IBM J. Res. Dev..

[22] B. Kågström,et al. Generalized Schur methods with condition estimators for solving the generalized Sylvester equation , 1989 .

[23] Y. Wang,et al. The Influence of the Compiler on the Cost of Mathematical Software—in Particular on the Cost of Triangular Factorization , 1975, TOMS.

[24] Michael E. Wolf,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[25] Christian H. Bischof,et al. The WY representation for products of householder matrices , 1985, PPSC.

[26] G. Forsythe,et al. Computer solution of linear algebraic systems , 1969 .

[27] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[28] Victor Eijkhout,et al. Recursive approach in sparse matrix LU factorization , 2001, Sci. Program..

[29] Piet Hut,et al. A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[30] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[31] Krister Dackland,et al. Blocked algorithms and software for reduction of a regular matrix pair to generalized Schur form , 1999, TOMS.

[32] Fred G. Gustavson,et al. A recursive formulation of Cholesky factorization of a matrix in packed storage , 2001, TOMS.

[33] Isak Jonsson,et al. High-Performance Matrix Multiplication on the IBM SP High Node , 1998 .

[34] Nicholas J. Higham,et al. Perturbation theory and backward error forAX−XB=C , 1993 .

[35] S. Hammarling. Numerical Solution of the Stable, Non-negative Definite Lyapunov Equation , 1982 .

[36] Isak Jonsson,et al. RECSY - A High Performance Library for Sylvester-Type Matrix Equations , 2003, Euro-Par.

[37] Mei Han An,et al. accuracy and stability of numerical algorithms , 1991 .

[38] BO K Agstr,et al. A GENERALIZED STATE-SPACE APPROACH FOR THE ADDITIVE DECOMPOSITION OF A TRANSFER MATRIX , 1992 .

[39] Mithuna Thottethodi,et al. Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[40] Sivan Toledo,et al. Out-of-Core SVD and QR Decompositions , 2001, PPSC.

[41] James Demmel,et al. On computing condition numbers for the nonsymmetric eigenproblem , 1993, TOMS.

[42] C. Loan,et al. A Storage-Efficient $WY$ Representation for Products of Householder Transformations , 1989 .

[43] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[44] Erik Elmroth,et al. Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[45] Ramesh C. Agarwal,et al. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms , 1994, IBM J. Res. Dev..

[46] Charles E. Leiserson,et al. Cache-Oblivious Algorithms , 2003, CIAC.

[47] Hanan Samet,et al. The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[48] Isak Jonsson,et al. Parallel Two-Sided Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking , 2002, PARA.

[49] Fred G. Gustavson,et al. New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms , 2001, PPAM.

[50] Bo Kågström,et al. Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues , 1998, TOMS.

[51] Jeremy D. Frens,et al. QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism , 2003, PPoPP '03.

[52] J. Uhlig. C. Forsythe and C. B. Moler, Computer Solution of Linear Algebraic Systems. (Series in Automatic Computation) XI + 148 S. Englewood Cliffs, N.J. 1967. Prentice-Hall, Inc. Preis geb. 54 s. net , 1972 .

[53] James Demmel,et al. Stability of block algorithms with fast level-3 BLAS , 1992, TOMS.

[54] B. S. Garbow,et al. Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[55] Isak Jonsson,et al. Superscalar GEMM-based Level 3 BLAS - The On-going Evolution of a Portable and High-Performance Library , 1998, PARA.