Discrete fourier transform on multicore

This article gives an overview on the techniques needed to implement the discrete Fourier transform (DFT) efficiently on current multicore systems. The focus is on Intel-compatible multicores, but we also discuss the IBM Cell and, briefly, graphics processing units (GPUs). The performance optimization is broken down into three key challenges: parallelization, vectorization, and memory hierarchy optimization. In each case, we use the Kronecker product formalism to formally derive the necessary algorithmic transformations based on a few hardware parameters. Further code-level optimizations are discussed. The rigorous nature of this framework enables the complete automation of the implementation task as shown by the program generator Spiral. Finally, we show and analyze DFT benchmarks of the fastest libraries available for the considered platforms.

[1]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[3]  Paul N. Swarztrauber,et al.  Multiprocessor FFTs , 1987, Parallel Comput..

[4]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[5]  Mitsuhisa Sato,et al.  An Implementation of Parallel 3-D FFT Using Short Vector SIMD Instructions on Clusters of PCs , 2004, PARA.

[6]  Alan Norton,et al.  Parallelization and Performance Analysis of the Cooley–Tukey FFT Algorithm for Shared-Memory Architectures , 1987, IEEE Transactions on Computers.

[7]  M. Puschel,et al.  FFT Program Generation for Shared Memory: SMP and Multicore , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[8]  Franz Franchetti,et al.  Generating SIMD Vectorized Permutations , 2008, CC.

[9]  S. Lennart Johnsson,et al.  Scheduling FFT computation on SMP and multicore systems , 2007, ICS '07.

[10]  Franz Franchetti,et al.  A Rewriting System for the Vectorization of Signal Transforms , 2006, VECPAR.

[11]  Franz Franchetti,et al.  Short vector code generation for the discrete Fourier transform , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[12]  Franz Franchetti,et al.  A SIMD vectorizing compiler for digital signal processing algorithms , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[13]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[14]  Franz Franchetti,et al.  Formal loop merging for signal transforms , 2005, PLDI '05.

[15]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  R. W. Johnson,et al.  A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures , 1990 .

[17]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[18]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[19]  Franz Franchetti,et al.  Formal datapath representation and manipulation for implementing DSP transforms , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[20]  Markus Hegland Block Algorithms for FFTs on Vector and Parallel Computers , 1993, PARCO.

[21]  Dragan Mirkovic,et al.  Automatic Performance Tuning in the UHFFT Library , 2001, International Conference on Computational Science.

[22]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[23]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[24]  Daisuke Takahashi An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors , 2006, PARA.

[25]  Dinesh Manocha,et al.  Cache-efficient numerical algorithms using graphics hardware , 2007, Parallel Comput..

[26]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[27]  Franz Franchetti,et al.  Computer generation of fast fourier transforms for the cell broadband engine , 2009, ICS '09.

[28]  J. McClellan,et al.  Vector radix fast Fourier transform , 1977 .

[29]  Larry Carter,et al.  Multi-processor Performance on the Tera MTA , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[30]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[31]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[32]  Marshall C. Pease,et al.  An Adaptation of the Fast Fourier Transform for Parallel Processing , 1968, JACM.

[33]  Markus Püschel,et al.  Computer Generation of General Size Linear Transform Libraries , 2009, 2009 International Symposium on Code Generation and Optimization.

[34]  Franz Franchetti,et al.  How to Write Fast Numerical Code: A Small Introduction , 2007, GTTSE.

[35]  P. Bientinesi,et al.  Multi-dimensional Array Operations for Signal Processing Algorithms , 2008 .

[36]  C. Sidney Burrus,et al.  Automatic generation of prime length FFT programs , 1996, IEEE Trans. Signal Process..

[37]  Gerhard Goos A Programming Example , 1983 .

[38]  G. Blake,et al.  A survey of multicore processors , 2009, IEEE Signal Processing Magazine.

[39]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[40]  Yevgen Voronenko,et al.  Library generation for linear transforms , 2008 .