FFTW: an adaptive software architecture for the FFT

FFT literature has been mostly concerned with minimizing the number of floating-point operations performed by an algorithm. Unfortunately, on present-day microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have a larger impact on performance. Consequently, one must know the details of a computer architecture in order to design a fast algorithm. In this paper, we propose an adaptive FFT program that tunes the computation automatically for any particular hardware. We compared our program, called FFTW, with over 40 implementations of the FFT on 7 machines. Our tests show that FFTW's self-optimizing approach usually yields significantly better performance than all other publicly available software. FFTW also compares favorably with machine-specific, vendor-optimized libraries.

[1]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[2]  C. Rader Discrete Fourier transforms when the number of data samples is prime , 1968 .

[3]  R. Singleton An algorithm for computing the mixed radix fast Fourier transform , 1969 .

[4]  Paul N. Swarztrauber,et al.  Vectorizing the FFTs , 1982 .

[5]  C. Sidney Burrus,et al.  The design of optimal DFT algorithms using dynamic programming , 1982, ICASSP.

[6]  Paul N. Swarztrauber,et al.  FFT algorithms for vector computers , 1984, Parallel Comput..

[7]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[8]  C. Sidney Burrus,et al.  On computing the split-radix FFT , 1986, IEEE Trans. Acoust. Speech Signal Process..

[9]  Tadao Takaoka,et al.  A prime factor FFT algorithm implementation using a program generation technique , 1987, IEEE Trans. Acoust. Speech Signal Process..

[10]  David H. Bailey A High-Performance FFT Algorithm for Vector Supercomputers , 1987, PPSC.

[11]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[12]  Clive Temperton,et al.  A Generalized Prime Factor FFT Algorithm for any N = 2p 3q 5r , 1992, SIAM J. Sci. Comput..

[13]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[14]  Claude Brezinski,et al.  Numerical recipes in Fortran (The art of scientific computing) : W.H. Press, S.A. Teukolsky, W.T. Vetterling and B.P. Flannery, Cambridge Univ. Press, Cambridge, 2nd ed., 1992. 963 pp., US$49.95, ISBN 0-521-43064-X.☆ , 1993 .

[15]  C. Sidney Burrus,et al.  The quick discrete Fourier transform , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  W. Press,et al.  Numerical Recipes in Fortran: The Art of Scientific Computing.@@@Numerical Recipes in C: The Art of Scientific Computing. , 1994 .

[17]  Sandeep K. S. Gupta,et al.  A Framework for Generating Distributed-Memory Parallel Programs for Block Recursive Algorithms , 1986, J. Parallel Distributed Comput..

[18]  C. Sidney Burrus,et al.  Automatic generation of prime length FFT programs , 1996, IEEE Trans. Signal Process..

[19]  Samuel N. Kamin Standard ML as a Meta-Programming Language , 1996 .

[20]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[21]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.