Parallel matrix transpose algorithms on distributed memory concurrent computers

This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P/spl times/Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C=A/spl middot/B, the algorithms are used to compute parallel multiplications of transposed matrices, C=A/sup T//spl middot/B/sup T/, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.<<ETX>>

[1]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[2]  George E. Forsythe,et al.  Computer science and mathematics , 1970, SGCS.

[3]  Shahid H. Bokhari,et al.  Complete exchange on a circuit switched mesh , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[4]  B. Buslee Supercomputers: Value and Trends Bill Buzbee, Computer Research and Applications Group, Computing and Communications Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545 , 1987 .

[5]  J. O. Eklundh,et al.  A Fast Computer Method for Matrix Transposing , 1972, IEEE Transactions on Computers.

[6]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[7]  Soo-Young Lee,et al.  Synchronous And Asynchronous Algorithms For Matrix Transposition On MCAP , 1988, Optics & Photonics.

[8]  R. van de Geijn,et al.  A look at scalable dense linear algebra libraries , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[9]  Dianne P. O'Leary,et al.  Systolic Arrays for Matrix Transpose and Other Reorderings , 1987, IEEE Transactions on Computers.

[10]  Jaeyoung Choi,et al.  The design of scalable software libraries for distributed memory concurrent computers , 1994, Proceedings of 8th International Parallel Processing Symposium.

[11]  Peter D. Lax,et al.  Almost Periodic Behavior of Nonlinear Waves**Results obtained at the Courant Institute of Mathematical Sciences, New York University, under Contract AT(11–1)-3077 with the U.S. Atomic Energy Commission. , 1976 .

[12]  S. Lennart Johnsson,et al.  Algorithms for Matrix Transposition on Boolean n-Cube Configured Ensemble Architectures , 1988, ICPP.