Parallel Matrix Multiplication: A Systematic Journey

We expose a systematic approach for developing distributed-memory parallel matrix-matrix multiplication algorithms. The journey starts with a description of how matrices are distributed to meshes of nodes (e.g., MPI processes), relates these distributions to scalable parallel implementation of matrix-vector multiplication and rank-1 update, continues on to reveal a family of matrix-matrix multiplication algorithms that view the nodes as a two-dimensional (2D) mesh, and finishes with extending these 2D algorithms to so-called three-dimensional (3D) algorithms that view the nodes as a 3D mesh. A cost analysis shows that the 3D algorithms can attain the (order of magnitude) lower bound for the cost of communication. The paper introduces a taxonomy for the resulting family of algorithms and explains how all algorithms have merit depending on parameters such as the sizes of the matrices and architecture parameters. The techniques described in this paper are at the heart of the Elemental distributed-memory line...

[1]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[2]  Geoffrey C. Fox,et al.  Solving problems on concurrent processors: vol. 2 , 1990 .

[3]  James Demmel,et al.  Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1 , 2013, ArXiv.

[4]  Rob H. Bisseling,et al.  Scientific Computing on Bulk Synchronous Parallel Architectures , 1994, IFIP Congress.

[5]  Robert A. van de Geijn,et al.  Distributed memory matrix-vector multiplication and conjugate gradient algorithms , 1993, Supercomputing '93. Proceedings.

[6]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[7]  S. Huss-Lederman,et al.  Comparison of scalable parallel matrix multiplication libraries , 1993, Proceedings of Scalable Parallel Libraries Conference.

[8]  Ramesh C. Agarwal,et al.  A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..

[9]  W. Marsden I and J , 2012 .

[10]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[11]  Robert A. van de Geijn,et al.  A flexible class of parallel matrix multiplication algorithms , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[12]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[13]  Bruce Hendrickson,et al.  The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers , 1994, SIAM J. Sci. Comput..

[14]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[15]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[16]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[17]  Rob H. Bisseling,et al.  Parallel iterative solution of sparse linear systems on a transputer network , 1994 .

[18]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[19]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[20]  G. W. Stewart Communication and matrix computations on large message passing systems , 1990, Parallel Comput..

[21]  Grey Ballard,et al.  Avoiding Communication in Dense Linear Algebra , 2013 .

[22]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[23]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[24]  Guodong Zhang,et al.  Matrix multiplication on the Intel Touchstone Delta , 1994, Concurr. Pract. Exp..

[25]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[26]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[27]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[28]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[29]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[30]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[31]  R. V. D. Geijn,et al.  Parallel Matrix Distributions: Have we been doing it all wrong? , 1995 .

[32]  Anthony Skjellum,et al.  A poly‐algorithm for parallel dense matrix multiplication on two‐dimensional process grid topologies , 1997 .

[33]  Steven J. Plimpton,et al.  An Efficient Parallel Algorithm for Matrix-Vector Multiplication , 1995, Int. J. High Speed Comput..

[34]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[35]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[36]  Robert A. van de Geijn,et al.  Using PLAPACK - parallel linear algebra package , 1997 .

[37]  Martin D. Schatz Anatomy of Parallel Computation with Tensors FLAME Working Note # 72 Ph , 2013 .