The Design and Implementation of FFTW3

FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.

[1]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[2]  Thomas G. Stockham,et al.  High-speed convolution and correlation , 1966, AFIPS '66 (Spring).

[3]  W. M. Gentleman,et al.  Fast Fourier Transforms: for fun and profit , 1966, AFIPS '66 (Fall).

[4]  Richard C. Singleton,et al.  On computing the fast Fourier transform , 1967, Commun. ACM.

[5]  Glenn D. Bergland,et al.  Numerical Analysis: A fast fourier transform algorithm for real-valued series , 1968, CACM.

[6]  C. Rader Discrete Fourier transforms when the number of data samples is prime , 1968 .

[7]  Mit Press,et al.  A Linear Filtering Approach to the Computation of the Discrete Fourier Transform , 1969 .

[8]  R. Singleton An algorithm for computing the mixed radix fast Fourier transform , 1969 .

[9]  L. Bluestein A linear filtering approach to the computation of discrete Fourier transform , 1970 .

[10]  S. Winograd On computing the Discrete Fourier Transform. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[11]  David W. Twigg,et al.  Algorithm 513: Analysis of In-Situ Transposition [F1] , 1977, TOMS.

[12]  J. Makhoul A fast cosine transform in one and two dimensions , 1980 .

[13]  H. Nussbaumer Fast Fourier transform and convolution algorithms , 1981 .

[14]  Paul N. Swarztrauber,et al.  Vectorizing the FFTs , 1982 .

[15]  C. Temperton Fast Mixed-Radix Real Fourier Transforms , 1983 .

[16]  Don H. Johnson,et al.  Gauss and the history of the fast Fourier transform , 1984, IEEE ASSP Magazine.

[17]  Zhongde Wang Fast algorithms for the discrete W transform and for the discrete Fourier transform , 1984 .

[18]  Paul N. Swarztrauber,et al.  FFT algorithms for vector computers , 1984, Parallel Comput..

[19]  C. Sidney Burrus,et al.  An in-order, in-place radix-2 FFT , 1984, ICASSP.

[20]  ZHONGDE WANG On computing the discrete Fourier and cosine transforms , 1985, IEEE Trans. Acoust. Speech Signal Process..

[21]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[22]  C. Sidney Burrus,et al.  On computing the split-radix FFT , 1986, IEEE Trans. Acoust. Speech Signal Process..

[23]  Ronald N. Bracewell The Hartley transform , 1986 .

[24]  C. Loeffler,et al.  Searching for the best Cooley-Tukey FFT algorithms , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Douglas L. Jones,et al.  Real-valued fast Fourier transform algorithms , 1987, IEEE Trans. Acoust. Speech Signal Process..

[26]  H. Massalin Superoptimizer: a look at the smallest program , 1987, ASPLOS.

[27]  Kenji Nakayama An improved fast Fourier transform algorithm using mixed frequency and time decimations , 1988, IEEE Trans. Acoust. Speech Signal Process..

[28]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[29]  P. Yip,et al.  Discrete Cosine Transform: Algorithms, Advantages, Applications , 1990 .

[30]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[31]  S. C. Chan,et al.  Direct methods for computing discrete sinusoidal transforms , 1990 .

[32]  Clive Temperton Self-Sorting In-Place Fast Fourier Transforms , 1991, SIAM J. Sci. Comput..

[33]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[34]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[35]  K. Ho,et al.  Fast algorithms for computing the discrete cosine transform , 1992 .

[36]  M. Hegland A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing , 1994 .

[37]  Chao Lu,et al.  Self-sorting in-place FFT algorithm with minimum working space , 1994, IEEE Trans. Signal Process..

[38]  Ali Saidi,et al.  Decimation-in-time-frequency FFT algorithm , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Stephen A. Martucci,et al.  Symmetric convolution and the discrete sine and cosine transforms , 1993, IEEE Trans. Signal Process..

[40]  Murray Dow Transposing a Matrix on a Vector Computer , 1995, Parallel Comput..

[41]  Sandeep K. S. Gupta,et al.  A Framework for Generating Distributed-Memory Parallel Programs for Block Recursive Algorithms , 1986, J. Parallel Distributed Comput..

[42]  James C. Schatzman,et al.  Accuracy of the Discrete Fourier Transform and the Fast Fourier Transform , 1996, SIAM J. Sci. Comput..

[43]  Alan H. Karp Bit Reversal on Uniprocessors , 1996, SIAM Rev..

[44]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[45]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[46]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[47]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[48]  Alan V. Oppenheim,et al.  Discrete-time signal processing (2nd ed.) , 1999 .

[49]  Larry Carter,et al.  Portable high performance programming via architecture-cognizant divide-and-conquer algorithms , 2000 .

[50]  James Demmel,et al.  Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW , 2000, SAIG.

[51]  Jiun-In Guo An efficient design for one-dimensional discrete Hartley transform using parallel additions , 2000, IEEE Trans. Signal Process..

[52]  Ping Tak Peter Tang A Comprehensive DFT API for Scientific Computing , 2000, The Architecture of Scientific Software.

[53]  Daisuke Takahashi,et al.  A Blocking Algorithm for FFT on Cache-Based Processors , 2001, HPCN Europe.

[54]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[55]  Franz Franchetti,et al.  Architecture independent short vector FFTs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[56]  Manuela M. Veloso,et al.  Learning to Construct Fast Signal Processing Implementations , 2002, J. Mach. Learn. Res..

[57]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[58]  Fabrice Labeau,et al.  Discrete Time Signal Processing , 2004 .

[59]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[60]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[61]  Franz Franchetti,et al.  Efficient Utilization of SIMD Extensions , 2005, Proceedings of the IEEE.