Strassen's Algorithm Reloaded

We dispel with “street wisdom” regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel® Xeon Phi™ coprocessor1 utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

[1]  V. Strassen Gaussian elimination is not optimal , 1969 .

[2]  A. Smirnov,et al.  The bilinear complexity and practical algorithms for matrix multiplication , 2013 .

[3]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[4]  Devin Matthews,et al.  High-Performance Tensor Contraction without BLAS , 2016, ArXiv.

[5]  Wei Huang,et al.  Design of High Performance MVAPICH2: MPI2 over InfiniBand , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[6]  Jack Dongarra,et al.  LAPACK Users' guide (third ed.) , 1999 .

[7]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[8]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[9]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[10]  Arnold Schönhage,et al.  Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[11]  J. R. Johnson,et al.  Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[12]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[13]  Paolo Bientinesi,et al.  Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..

[14]  Alexandru Nicolau,et al.  Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation , 2011, TOMS.

[15]  Robert A. van de Geijn,et al.  Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16]  Jean-Guillaume Dumas,et al.  Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm , 2007, ISSAC '09.

[17]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[18]  F. V. Zee,et al.  0 Implementing high-performance complex matrix multiplication via the 1 m method , 2017 .

[19]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[20]  Shmuel Winograd,et al.  On multiplication of 2 × 2 matrices , 1971 .

[21]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[23]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[24]  Oded Schwartz,et al.  Improving the Numerical Stability of Fast Matrix Multiplication , 2015, SIAM J. Matrix Anal. Appl..

[25]  Grazia Lotti,et al.  O(n2.7799) Complexity for n*n Approximate Matrix Multiplication , 1979, Inf. Process. Lett..

[26]  Robert A. van de Geijn,et al.  A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..

[27]  Y. Danieli Guide , 2005 .

[28]  Michael A. Heroux,et al.  GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[29]  Jianyu Huang,et al.  Performance optimization for the k-nearest neighbors kernel on x86 architectures , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.