Matrix Multiplication on Boolean Cubes using Generic Communication Primitives

Generic primitives for matrix operations as defined by the level one, two and three of the BLAS are of great value in that they make user programs much simpler, and hide most of the architectular detail of improtance for performence in the primitives. We describe generic shared memory primitives such as one-to-all and all-to-all broadcasting, and one-to-all and all-to-all personalized communication, and implementations theoref thar are within a factor of two of the best known lower bounds. We describe algorithms for the multiplication of arbitrarily shaped matrices using these primitives. Of the three loops required for a standard matrix multiplication algorithm expressed in Fortran all three can be parallelised. We show that if one loop is parallelised, then the processors shall be aligned with the loops having the most elements. Depending on the initial matrix allocation data permutatuions may be required to accomplish the processor/loop alignment. This permutation id included in our analysis. We show that in parallelizing two loops the optimum aspect ratio of the processing plane is equal to the ratio of the number of matrix elements in the two loops being parallelized