论文信息 - Strassen's Algorithm Reloaded

Strassen's Algorithm Reloaded

We dispel with “street wisdom” regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel® Xeon Phi™ coprocessor1 utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

[1] V. Strassen. Gaussian elimination is not optimal , 1969 .

[2] A. Smirnov,et al. The bilinear complexity and practical algorithms for matrix multiplication , 2013 .

[3] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[4] Devin Matthews,et al. High-Performance Tensor Contraction without BLAS , 2016, ArXiv.

[5] Wei Huang,et al. Design of High Performance MVAPICH2: MPI2 over InfiniBand , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[6] Jack Dongarra,et al. LAPACK Users' guide (third ed.) , 1999 .

[7] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[8] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[9] James Demmel,et al. Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[10] Arnold Schönhage,et al. Partial and Total Matrix Multiplication , 1981, SIAM J. Comput..

[11] J. R. Johnson,et al. Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[12] Endong Wang,et al. Intel Math Kernel Library , 2014 .

[13] Paolo Bientinesi,et al. Design of a High-Performance GEMM-like Tensor–Tensor Multiplication , 2016, ACM Trans. Math. Softw..

[14] Alexandru Nicolau,et al. Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation , 2011, TOMS.

[15] Robert A. van de Geijn,et al. Anatomy of High-Performance Many-Threaded Matrix Multiplication , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16] Jean-Guillaume Dumas,et al. Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm , 2007, ISSAC '09.

[17] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[18] F. V. Zee,et al. 0 Implementing high-performance complex matrix multiplication via the 1 m method , 2017 .

[19] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[20] Shmuel Winograd,et al. On multiplication of 2 × 2 matrices , 1971 .

[21] James Demmel,et al. Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[22] Nicholas J. Higham,et al. INVERSE PROBLEMS NEWSLETTER , 1991 .

[23] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[24] Oded Schwartz,et al. Improving the Numerical Stability of Fast Matrix Multiplication , 2015, SIAM J. Matrix Anal. Appl..

[25] Grazia Lotti,et al. O(n2.7799) Complexity for n*n Approximate Matrix Multiplication , 1979, Inf. Process. Lett..

[26] Robert A. van de Geijn,et al. A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..

[27] Y. Danieli. Guide , 2005 .

[28] Michael A. Heroux,et al. GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[29] Jianyu Huang,et al. Performance optimization for the k-nearest neighbors kernel on x86 architectures , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.