DFT Performance Prediction in FFTW

Fastest Fourier Transform in the West (FFTW) is an adaptive FFT library that generates highly efficient Discrete Fourier Transform (DFT) implementations. It is one of the fastest FFT libraries available and it outperforms many adaptive or hand-tuned DFT libraries. Its success largely relies on the huge search space spanned by several FFT algorithms and a set of compiler generated C code (called codelets) for small size DFTs. FFTW empirically finds the best algorithm by measuring the performance of different algorithm combinations. Although the empirical search works very well for FFTW, the search process does not explain why the best plan found performs best, and the search overhead grows polynomially as the DFT size increases. The opposite of empirical search is model-driven optimization. However, it is widely believed that model-driven optimization is inferior to empirical search and is particularly powerless to solve problems as complex as the optimization of DFT. In this paper, we propose a model-driven DFT performance predictor that can replace the empirical search engine in FFTW. Our technique adapts to different architectures and automatically predicts the performance of DFT algorithms and codelets (including SIMD codelets). Our experiments show that this technique renders DFT implementations that achieve more than 95% of the performance with the original FFTW and uses less than 5% of the search overhead on four test platforms. More importantly, our models give insight on why different combinations of DFT algorithms perform differently on a processor given its architectural features.

[1]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[2]  Basilio B. Fraguela,et al.  Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[3]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[4]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[5]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[6]  L. Bluestein A linear filtering approach to the computation of discrete Fourier transform , 1970 .

[7]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[8]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[9]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[10]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[11]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[12]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[13]  Gary S. Tyson,et al.  In search of near-optimal optimization phase orderings , 2006 .

[14]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[15]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[16]  C. Rader Discrete Fourier transforms when the number of data samples is prime , 1968 .

[17]  Gary S. Tyson,et al.  In search of near-optimal optimization phase orderings , 2006, LCTES '06.

[18]  Alan Jay Smith,et al.  Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.

[19]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.