A fast Fourier transform compiler

The FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the performance-critical code was generated automatically by a special-purpose compiler, called genfft, that outputs C code. Written in Objective Caml, genfft can produce DFT programs for any input length, and it can specialize the DFT program for the common case where the input data are real instead of complex. Unexpectedly, genfft "discovered" algorithms that were previously unknown, and it was able to reduce the arithmetic complexity of some other existing algorithms. This paper describes the internals of this special-purpose compiler in some detail, and it argues that a specialized compiler is a valuable tool.

[1]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[2]  C. Rader Discrete Fourier transforms when the number of data samples is prime , 1968 .

[3]  Donald Ervin Knuth,et al.  The Art of Computer Programming, Volume II: Seminumerical Algorithms , 1970 .

[4]  A.V. Oppenheim,et al.  Analysis of linear digital networks , 1975, Proceedings of the IEEE.

[5]  S. Winograd On computing the Discrete Fourier Transform. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[6]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[7]  C. Sidney Burrus,et al.  The design of optimal DFT algorithms using dynamic programming , 1982, ICASSP.

[8]  J. A. Maruhn Fourgen: A fast fourier transform program generator , 1984 .

[9]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[10]  Tadao Takaoka,et al.  A prime factor FFT algorithm implementation using a program generation technique , 1987, IEEE Trans. Acoust. Speech Signal Process..

[11]  Douglas L. Jones,et al.  Real-valued fast Fourier transform algorithms , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[13]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[14]  J. Cooley,et al.  Factorization method for crystallographic Fourier transforms , 1990 .

[15]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[16]  Raymond D. Kent,et al.  Acoustic Analysis of Speech , 2009 .

[17]  Will Partain,et al.  The nofib Benchmark Suite of Haskell Programs , 1992, Functional Programming.

[18]  Willem G. Vree,et al.  Arrays in a lazy functional language -- a case study: the fast Fourier transform , 1992 .

[19]  Joanna L. Kulik,et al.  Implementing compiler optimizations using parallel graph reduction , 1995 .

[20]  Sandeep K. S. Gupta,et al.  A Framework for Generating Distributed-Memory Parallel Programs for Block Recursive Algorithms , 1986, J. Parallel Distributed Comput..

[21]  C. Sidney Burrus,et al.  Automatic generation of prime length FFT programs , 1996, IEEE Trans. Signal Process..

[22]  Todd L. Veldhuizen,et al.  Using C++ template metaprograms , 1996 .

[23]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[24]  J. Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997 .

[25]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[26]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[27]  Philip Wadler,et al.  How to declare an imperative , 1997, CSUR.

[28]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[29]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[30]  Nate Kushman,et al.  Performance Nonmonotonicities: A Case Study of the UltraSPARC Processor , 1998 .

[31]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[32]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[33]  James Demmel,et al.  Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW , 2000, SAIG.

[34]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[35]  Franz Franchetti,et al.  Architecture independent short vector FFTs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[36]  David Thomas,et al.  The Art in Computer Programming , 2001 .