Matrix Multiplication on Multidimensional Torus Networks

Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure. This algorithm is useful for torus interconnects that can achieve more injection bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon can lower the algorithmic bandwidth cost by a factor of up to d. With rectangular collectives, SUMMA also achieves the lower bandwidth cost but has a higher latency cost. We use Charm++ virtualization to efficiently map SD-Cannon on unbalanced and odd-dimensional torus network partitions. Our performance study on Blue Gene/P demonstrates that a MPI version of SD-Cannon can exploit multiple communication links and improve performance.

[1]  James Demmel,et al.  Minimizing Communication in Linear Algebra , 2009, ArXiv.

[2]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Jarle Berntsen,et al.  Communication efficient matrix multiplication on hypercubes , 1989, Parallel Comput..

[4]  S. Lennart Johnsson,et al.  Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..

[5]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[6]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[7]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[8]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[10]  Sartaj Sahni,et al.  Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[11]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[12]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[13]  William Gropp,et al.  Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .

[14]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[15]  Fumiyoshi Shoji,et al.  The K computer: Japanese next-generation supercomputer development project , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[16]  William J. Dally,et al.  Performance Analysis of k-Ary n-Cube Interconnection Networks , 1987, IEEE Trans. Computers.

[17]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[18]  Robert A. van de Geijn,et al.  A Pipelined Broadcast for Multidimensional Meshes , 1995, Parallel Process. Lett..

[19]  Amith R. Mamidala,et al.  MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[20]  Emmanuel Jeannot,et al.  Euro-Par 2011 Parallel Processing , 2011, Lecture Notes in Computer Science.

[21]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.