Real and Complex Fast Fourier Transforms on the Fujitsu VPP 500

Fast Fourier transforms parallelize well but need large amounts of communication. An algorithm which concentrates all the communication in one or two transposition steps is the transpose split algorithm. Different transposition algorithms can be used depending on data size and communication latency. A new transpose split algorithm for real and hermitian data is presented for one, two and three dimensional transforms. This algorithm is implemented on the Fujitsu VPP 500. The Fujitsu VPP 500 is a parallel processor with a moderate number of very fast vector processors connected by a crossbar switch. Each processor has a peak performance of 1.6 Gflop/s and can simultaneously read and write 400 MByte/s. Very long vector length stride one implementations of multiple FFTs on one node, as described by the author in 1994, are combined with optimized transpositions. One third of peak performance was achieved on a configuration with up to 32 processors.

[1]  Christophe Calvin,et al.  Minimizing Communication Overhead Using Pipelining for Multi-Dimensional FFT on Distributed Memory Machines , 1993, PARCO.

[2]  Dan E. Dudgeon,et al.  Multidimensional Digital Signal Processing , 1983 .

[3]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[4]  Clive Temperton,et al.  Fast methods on parallel and vector machines , 1982 .

[5]  M. Hegland A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing , 1994 .

[6]  J. O. Eklundh Efficient Matrix Transposition , 1981 .

[7]  G. Rodrigue Parallel Computations , 1982 .

[8]  Richard C. Singleton,et al.  On computing the fast Fourier transform , 1967, Commun. ACM.

[9]  Paul N Swarztrauber Symmetric FFTs , 1986 .

[10]  J. J. Lambiotte,et al.  Computing the Fast Fourier Transform on a vector computer , 1979 .

[11]  Markus Hegland Block Algorithms for FFTs on Vector and Parallel Computers , 1993, PARCO.

[12]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[13]  S. Lennart Johnsson,et al.  Cooley-Tukey FFT on the Connection Machine , 1992, Parallel Comput..

[14]  Marshall C. Pease,et al.  An Adaptation of the Fast Fourier Transform for Parallel Processing , 1968, JACM.

[15]  Paul N. Swarztrauber,et al.  Vectorizing the FFTs , 1982 .

[16]  Paul N. Swarztrauber,et al.  FFT algorithms for vector computers , 1984, Parallel Comput..

[17]  Vipin Kumar,et al.  The Scalability of FFT on Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..