Generalizing Matrix Multiplication for Efficient Computations on Modern Computers

Recent advances in computing allow taking new look at matrix multiplication, where the key ideas are: decreasing interest in recursion, development of processors with thousands (potentially millions) of processing units, and influences from the Algebraic Path Problems. In this context, we propose a generalized matrix-matrix multiply-add (MMA) operation and illustrate its usability. Furthermore, we elaborate the interrelation between this generalization and the BLAS standard.

[1]  B. David Saunders,et al.  Transitive Closure and Related Semiring Properties via Eliminants , 1985, Theor. Comput. Sci..

[2]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[3]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[4]  C Jiang AN OPTIMAL ALGORITHM FOR MATRIX MULTIPLICATION , 1990 .

[5]  Jack J. Dongarra,et al.  A proposal for a set of level 3 basic linear algebra subprograms , 1987, SGNM.

[6]  Anthony Skjellum,et al.  A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies , 1997, Concurr. Pract. Exp..

[7]  David H. Bailey,et al.  Using Strassen's algorithm to accelerate the solution of linear systems , 1991, The Journal of Supercomputing.

[8]  Jack Dongarra,et al.  Experiments with Strassen's Algorithm: From Sequential to Parallel , 2006 .

[9]  Robert A. van de Geijn,et al.  A High Performance Parallel Strassen Implementation , 1995, Parallel Process. Lett..

[10]  Erdem Hokenek,et al.  Design of the IBM RISC System/6000 Floating-Point Execution Unit , 1990, IBM J. Res. Dev..

[11]  Thomas Rauber,et al.  Combining building blocks for parallel multi-level matrix multiplication , 2008, Parallel Comput..

[12]  Daniel J. Lehmann,et al.  Algebraic Structures for Transitive Closure , 1976, Theor. Comput. Sci..

[13]  Toshiaki Miyazaki,et al.  Orbital Systolic Algorithms and Array Processors for Solution of the Algebraic Path Problem , 2010, IEICE Trans. Inf. Syst..

[14]  José E. Moreira,et al.  The fused multiply-add instruction leads to algorithms for extended-precision floating point: applications to java and high-performance computing , 1999, CASCON.

[15]  Garrett Birkhoff,et al.  A survey of modern algebra , 1942 .

[16]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[17]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[18]  Toshiaki Miyazaki,et al.  Orbital Algorithms and Unified Array Processor for Computing 2D Separable Transforms , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[19]  Stanislav G. Sedukhin,et al.  A Solution of the All-Pairs Shortest Paths Problem on the Cell Broadband Engine Processor , 2009, IEICE Trans. Inf. Syst..

[20]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[21]  Toshiaki Miyazaki,et al.  Rapid*Closure: Algebraic Extensions of a Scalar Multiply-add Operation , 2010, CATA.