SIMD Vectorization of Non-Two-Power Sized FFTs

SIMD (single instruction multiple data) vector instructions, such as Intel's SSE family, are available on most architectures, but are difficult to exploit for speed-up. In many cases, such as the fast Fourier transform (FFT), signal processing algorithms have to undergo major transformations to map efficiently. Using the Kronecker product formalism, we rigorously derive a novel variant of the general-radix Cooley-Tukey FFT that is structured to map efficiently for any vector length v and radix. Then, we include the new FFT into the program generator spiral to generate actual C implementations. Benchmarks on Intel's SSE show that the new algorithms perform better on practically all sizes than the best available libraries Intel's MKL and FFTW.

[1]  Franz Franchetti,et al.  Short vector code generation for the discrete Fourier transform , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[2]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[3]  Franz Franchetti,et al.  A Rewriting System for the Vectorization of Signal Transforms , 2006, VECPAR.

[4]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  M. Puschel,et al.  FFT Program Generation for Shared Memory: SMP and Multicore , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[6]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[7]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.