论文信息 - Implementing Strassen's Algorithm with BLIS

Implementing Strassen's Algorithm with BLIS

We dispel with “street wisdom” regarding the practical implementation of Strassen’s algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel R © Xeon Phi coprocessor utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

[1] Michael A. Heroux,et al. GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[2] Shmuel Winograd,et al. On multiplication of 2 × 2 matrices , 1971 .

[3] Jianyu Huang,et al. Performance optimization for the k-nearest neighbors kernel on x86 architectures , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4] Jean-Guillaume Dumas,et al. Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm , 2007, ISSAC '09.

[5] Robert A. van de Geijn,et al. A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..

[6] Austin R. Benson,et al. A framework for practical parallel fast matrix multiplication , 2014, PPoPP.

[7] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[8] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[9] Robert A. van de Geijn,et al. Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[10] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[11] Arnold Schönhage,et al. Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[12] Wei Huang,et al. Design of High Performance MVAPICH2: MPI2 over InfiniBand , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[13] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[14] Field G. Van Zee,et al. Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods , 2017, ACM Trans. Math. Softw..

[15] J. R. Johnson,et al. Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[16] Oded Schwartz,et al. Improving the Numerical Stability of Fast Matrix Multiplication , 2015, SIAM J. Matrix Anal. Appl..

[17] James Demmel,et al. Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[18] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[19] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[20] V. Strassen. Gaussian elimination is not optimal , 1969 .

[21] A. Smirnov,et al. The bilinear complexity and practical algorithms for matrix multiplication , 2013 .

[22] Alexandru Nicolau,et al. Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation , 2011, TOMS.

[23] James Demmel,et al. Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.