Discrete Fourier Transform on Multicores

This paper gives an overview on the techniques needed to implement the discrete Fourier transform (DFT) efficiently on current multicore systems. The focus is on Intel compatible multicores but we also discuss the IBM Cell, and briefly, graphics processing units (GPUs). The performance optimization is broken down into three key challenges: parallelization, vectorization, and memory hierarchy optimization. In each case, we use the Kronecker product formalism to formally derive the necessary algorithmic transformations based on a few hardware parameters. Further code level optimizations are discussed. The rigorous nature of this framework enables the complete automation of the implementation task as shown by the program generator Spiral. Finally, we show and analyze DFT benchmarks of the fastest libraries available for the considered platforms.

[1]  Franz Franchetti,et al.  Computer generation of fast fourier transforms for the cell broadband engine , 2009, ICS '09.

[2]  Markus Püschel,et al.  Computer Generation of General Size Linear Transform Libraries , 2009, 2009 International Symposium on Code Generation and Optimization.

[3]  Burton J. Smith,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Franz Franchetti,et al.  Formal datapath representation and manipulation for implementing DSP transforms , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[6]  Franz Franchetti,et al.  Generating SIMD Vectorized Permutations , 2008, CC.

[7]  P. Bientinesi,et al.  Multi-dimensional Array Operations for Signal Processing Algorithms , 2008 .

[8]  Yevgen Voronenko,et al.  Library generation for linear transforms , 2008 .

[9]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[10]  Dinesh Manocha,et al.  Cache-efficient numerical algorithms using graphics hardware , 2007, Parallel Comput..

[11]  Trevor N. Mudge Multicore architectures , 2007, CASES '07.

[12]  Franz Franchetti,et al.  How to Write Fast Numerical Code: A Small Introduction , 2007, GTTSE.

[13]  S. Lennart Johnsson,et al.  Scheduling FFT computation on SMP and multicore systems , 2007, ICS '07.

[14]  M. Puschel,et al.  FFT Program Generation for Shared Memory: SMP and Multicore , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[15]  Daisuke Takahashi An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors , 2006, PARA.

[16]  Franz Franchetti,et al.  A Rewriting System for the Vectorization of Signal Transforms , 2006, VECPAR.

[17]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[18]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[19]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[20]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[21]  Franz Franchetti,et al.  Short vector code generation for the discrete Fourier transform , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[22]  Franz Franchetti,et al.  A SIMD vectorizing compiler for digital signal processing algorithms , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[23]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[24]  Dragan Mirkovic,et al.  Automatic Performance Tuning in the UHFFT Library , 2001, International Conference on Computational Science.

[25]  Larry Carter,et al.  Multi-processor Performance on the Tera MTA , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[26]  C. Sidney Burrus,et al.  Automatic generation of prime length FFT programs , 1996, IEEE Trans. Signal Process..

[27]  Markus Hegland Block Algorithms for FFTs on Vector and Parallel Computers , 1993, PARCO.

[28]  R. W. Johnson,et al.  A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures , 1990 .

[29]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[30]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[31]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[32]  Alan Norton,et al.  Parallelization and Performance Analysis of the Cooley–Tukey FFT Algorithm for Shared-Memory Architectures , 1987, IEEE Transactions on Computers.

[33]  Paul N. Swarztrauber,et al.  Multiprocessor FFTs , 1987, Parallel Comput..

[34]  J. McClellan,et al.  Vector radix fast Fourier transform , 1977 .

[35]  Marshall C. Pease,et al.  An Adaptation of the Fast Fourier Transform for Parallel Processing , 1968, JACM.

[36]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .