Communication-efficient implementation of block recursive algorithms on distributed-memory machines

This paper presents a design methodology for developing efficient distributed-memory parallel programs for block-recursive algorithms such as the fast Fourier transform and bitonic sort. This design methodology is specifically suited for most modern supercomputers having a distributed-memory architecture with circuit-switched or wormhole routed mesh or hypercube interconnection network. A mathematical framework based on the tenser product and other matrix operations is used for representing algorithms. Communication-efficient implementations with effectively overlapped computation and communication are achieved by manipulating the mathematical representation using the tenser algebra. Performance results for FFT programs on the Intel iPSC/860 and Intel Paragon are presented.

[1]  F. Graybill,et al.  Matrices with Applications in Statistics. , 1984 .

[2]  Joe Brewer,et al.  Kronecker products and matrix calculus in system theory , 1978 .

[3]  D. S. Scott,et al.  Efficient All-to-All Communication Patterns in Hypercube and Mesh Topologies , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[4]  R. W. Johnson,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[5]  Rodney W. Johnson,et al.  Generating Parallel Programs from Tensor Product Formulas: A Case Study of Strassen's Matrix Multiplication Algorithm , 1992, ICPP.

[6]  Sanjit K. Mitra,et al.  Kronecker Products, Unitary Matrices and Signal Processing Applications , 1989, SIAM Rev..

[7]  F. Graybill,et al.  Matrices with Applications in Statistics. , 1984 .

[8]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[9]  V. Rich Personal communication , 1989, Nature.

[10]  Michael Conner,et al.  Recursive fast algorithm and the role of the tensor product , 1992, IEEE Trans. Signal Process..

[11]  Sandeep K. S. Gupta,et al.  A methodology for generating data distributions to optimize communication , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[12]  R. W. Johnson,et al.  A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures , 1990 .

[13]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[14]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[15]  P. Sadayappan,et al.  A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[16]  Alexander Graham,et al.  Kronecker Products and Matrix Calculus: With Applications , 1981 .

[17]  P. Sadayappan,et al.  An algebraic theory for modeling direct interconnection networks , 1992, Proceedings Supercomputing '92.

[18]  Sanjay Sharma,et al.  An Algebraic Theory for Modeling Multistage Interconnection Networks , 1993, J. Inf. Sci. Eng..