Generalized Cannon's algorithm for parallel matrix multiplication

Cannon’s algorithm is a memory-efficient matrix multiplication technique for parallel computers with toroidal mesh interconnections. This algorithm assumes that input matrices are block distributed, but it is not clear how it can deal with block-cyclic distributed matrices. This paper generalizes Cannon’s algorithm for the case when input matrices are blockcyclic distributed across a two-dimensional processor array with an arbitrary number of processors and toroidal mesh interconnections. An efficient scheduling technique is proposed so that the number of communication steps is reduced to the least common multiple of P and Q for a given P x Q processor array. In addition, a partitioning and communication scheme is proposed to reduce the number of page faults for the case when matrices are too large to fit into main memory. Performance analysis shows that the proposed generalized Cannon’s algorithm (GCA) requires fewer page faults than a previously proposed algorithm (SUMMA). Experimental results on Intel Paragon show that GCA performs better than SUMMA when blocks of size larger than about (65 x 65) are used. However, GCA performance degrades if the block size is relatively small while SUMMA maintains the same performance. It is also shown that GCA maintains higher performance for large matrices than SUMMA

[1]  Jaeyoung Choi,et al.  A Proposal for a Set of Parallel Basic Linear Algebra Subprograms , 1995, PARA.

[2]  Geoffrey C. Fox,et al.  Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..

[3]  Robert A. van de Geijn,et al.  Parallel implementation of BLAS: general techniques for Level 3 BLAS , 1995, Concurr. Pract. Exp..

[4]  S. Huss-Lederman,et al.  Comparison of scalable parallel matrix multiplication libraries , 1993, Proceedings of Scalable Parallel Libraries Conference.

[5]  Ramesh C. Agarwal,et al.  A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..

[6]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[7]  S. Lennart Johnsson,et al.  Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer , 1994, Parallel Comput..

[8]  Petter E. Bjørstad,et al.  Efficient Matrix Multiplication on SIMD Computers , 1992, SIAM J. Matrix Anal. Appl..

[9]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[10]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[11]  Guodong Zhang,et al.  Matrix multiplication on the Intel Touchstone Delta , 1994, Concurr. Pract. Exp..

[12]  S. Lennart Johnsson,et al.  Communication Efficient Basic Linear Algebra Computations on Hypercube Architectures , 1987, J. Parallel Distributed Comput..

[13]  Hyuk-Jae Lee,et al.  Toward data distribution independent parallel matrix multiplication , 1995, Proceedings of 9th International Parallel Processing Symposium.