Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

We present a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen's and Winograd's fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen's until a highly tuned BLAS matrix multiplication allows performance advantages). We investigate how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining. We have three contributions: first, we present a performance overview for double- and double-complex-precision matrices for state-of-the-art SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd's matrix multiplication (achieving up to 20% speedup with respect to regular matrix multiplication). About the latter Winograd's algorithms: one is designed to minimize the number of matrix additions and the other to minimize the computation latency of matrix additions; third, we apply software pipelining and threads allocation to all the algorithms and we show how this yields up to 10% further performance improvements.

[1]  Nicholas J. Higham,et al.  Exploiting fast matrix multiplication within the level 3 BLAS , 1990, TOMS.

[2]  Alexandru Nicolau,et al.  Adaptive Strassen's matrix multiplication , 2007, ICS '07.

[3]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[4]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[5]  Jean-Guillaume Dumas,et al.  Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm , 2007, ISSAC '09.

[6]  Michael Rodeh,et al.  Matrix Multiplication: A Case Study of Algorithm Engineering , 1998, WAE.

[7]  R. C. Whaley,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..

[8]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[9]  Christopher Umans Group-theoretic algorithms for matrix multiplication , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[10]  M.,et al.  Strassen ' s Algorithm for Matrix Multiplication : Modeling , Analysis , and ImplementationSteven , 1996 .

[11]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[12]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[13]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[14]  Marco Bodrato,et al.  A Strassen-like matrix multiplication suited for squaring and higher power computation , 2010, ISSAC.

[15]  J. Demmel,et al.  Sun Microsystems , 1996 .

[16]  James Demmel,et al.  Fast matrix multiplication is stable , 2006, Numerische Mathematik.

[17]  I. Kaporin A practical algorithm for faster matrix multiplication , 1999 .

[18]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[19]  Michael A. Heroux,et al.  GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm , 1994, Journal of Computational Physics.

[20]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[21]  Victor Y. Pan,et al.  Strassen's algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[22]  Bo Kågström,et al.  Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues , 1998, TOMS.

[23]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[24]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[25]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[26]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[27]  V. Pan How can we speed up matrix multiplication , 1984 .

[28]  Jean-Guillaume Dumas,et al.  Dense Linear Algebra over Word-Size Prime Fields: the FFLAS and FFPACK Packages , 2006, TOMS.

[29]  J. R. Johnson,et al.  Implementation of Strassen's Algorithm for Matrix Multiplication , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[30]  R. Brent Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity , 1970 .

[31]  Nicholas J. Higham,et al.  Accuracy and stability of numerical algorithms, Second Edition , 2002 .

[32]  Igor E. Kaporin,et al.  The aggregation and cancellation techniques as a practical tool for faster matrix multiplication , 2004, Theor. Comput. Sci..

[33]  Douglas M. Priest,et al.  Algorithms for arbitrary precision floating point arithmetic , 1991, [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic.

[34]  Alexandru Nicolau,et al.  Adaptive Winograd's matrix multiplications , 2009, TOMS.

[35]  Jack J. Dongarra,et al.  Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs , 1990, TOMS.

[36]  V. Strassen Gaussian elimination is not optimal , 1969 .

[37]  Alexandru Nicolau,et al.  Techniques for efficient placement of synchronization primitives , 2009, PPoPP '09.

[38]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[39]  Igor E. Kaporin,et al.  A practical algorithm for faster matrix multiplication , 1999, Numerical Linear Algebra with Applications.

[40]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[41]  NicolauAlexandru,et al.  Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems , 2011 .

[42]  R. Brent Algorithms for matrix multiplication , 1970 .

[43]  James Demmel,et al.  Stability of block algorithms with fast level-3 BLAS , 1992, TOMS.

[44]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.