Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon's algorithm and be faster in practice. "3D" algorithms arrange the p processors in a 3D array, and store redundant copies of the matrices on each of p1/3 layers. "2D" algorithms such as Cannon's algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of "2.5D algorithms". For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈ {1, 2,..., ⌊p1/3⌋}, to reduce the bandwidth cost of Cannon's algorithm by a factor of c1/2 and the latency cost by a factor c3/2. We also show that these costs reach the lower bounds, modulo polylog(p) factors. We introduce a novel algorithm for 2.5D LU decomposition. To the best of our knowledge, this LU algorithm is the first to minimize communication along the critical path of execution in the 3D case. Our 2.5D LU algorithm uses communicationavoiding pivoting, a stable alternative to partial-pivoting. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c1/2, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. We provide implementations and performance results for 2D and 2.5D versions of all the new algorithms. Our results demonstrate that 2.5D matrix multiplication and LU algorithms strongly scale more efficiently than 2D algorithms. Each of our 2.5D algorithms performs over 2X faster than the corresponding 2D algorithm for certain problem sizes on 65,536 cores of a BG/P supercomputer.

[1]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[2]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[3]  James Demmel,et al.  Communication avoiding Gaussian elimination , 2008, HiPC 2008.

[4]  Cleve Ashc Raft The fan-both family of column-based distributed Cholesky factorization algorithms , 1993 .

[5]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[6]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[7]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[8]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[9]  Alexander Tiskin,et al.  Memory-Efficient Matrix Multiplication in the BSP Model , 1999, Algorithmica.

[10]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[11]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[12]  A. George,et al.  Graph theory and sparse matrix computation , 1993 .

[13]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[14]  Dror Irony,et al.  TRADING REPLICATION FOR COMMUNICATION IN PARALLEL DISTRIBUTED-MEMORY DENSE SOLVERS , 2002 .

[15]  ToledoSivan,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004 .

[16]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[17]  Dror Irony,et al.  Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers , 2002, Parallel Process. Lett..

[18]  James Demmel,et al.  CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[19]  James Demmel,et al.  Fast linear algebra is stable , 2006, Numerische Mathematik.

[20]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, Hot Interconnects.

[21]  S. Lennart Johnsson,et al.  Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..