论文信息 - High-performance implementation of the level-3 BLAS

High-performance implementation of the level-3 BLAS

A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures.

Robert A. van de Geijn | Kazushige Goto | R. V. D. Geijn | K. Goto | Kazushige Goto

[1] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[2] Ed Anderson,et al. LAPACK Users' Guide , 1995 .

[3] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[5] Jack Dongarra,et al. LAPACK Users' guide (third ed.) , 1999 .

[6] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[7] Erik Elmroth,et al. SIAM REVIEW c ○ 2004 Society for Industrial and Applied Mathematics Vol. 46, No. 1, pp. 3–45 Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software ∗ , 2022 .

[8] Robert A. van de Geijn,et al. Toward Scalable Matrix Multiply on Multithreaded Architectures , 2007, Euro-Par.

[9] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.