Implementing Strassen's Algorithm with BLIS

We dispel with “street wisdom” regarding the practical implementation of Strassen’s algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel R © Xeon Phi coprocessor utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

[1]  Michael A. Heroux,et al.  GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[2]  Shmuel Winograd,et al.  On multiplication of 2 × 2 matrices , 1971 .

[3]  Jianyu Huang,et al.  Performance optimization for the k-nearest neighbors kernel on x86 architectures , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Jean-Guillaume Dumas,et al.  Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm , 2007, ISSAC '09.

[5]  Robert A. van de Geijn,et al.  A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..

[6]  Austin R. Benson,et al.  A framework for practical parallel fast matrix multiplication , 2014, PPoPP.

[7]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[8]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[9]  Robert A. van de Geijn,et al.  Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[10]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[11]  Arnold Schönhage,et al.  Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[12]  Wei Huang,et al.  Design of High Performance MVAPICH2: MPI2 over InfiniBand , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[13]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[14]  Field G. Van Zee,et al.  Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods , 2017, ACM Trans. Math. Softw..

[15]  J. R. Johnson,et al.  Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[16]  Oded Schwartz,et al.  Improving the Numerical Stability of Fast Matrix Multiplication , 2015, SIAM J. Matrix Anal. Appl..

[17]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[18]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[19]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[20]  V. Strassen Gaussian elimination is not optimal , 1969 .

[21]  A. Smirnov,et al.  The bilinear complexity and practical algorithms for matrix multiplication , 2013 .

[22]  Alexandru Nicolau,et al.  Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation , 2011, TOMS.

[23]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.