MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs

Fourier methods have revolutionized many fields of science and engineering, such as astronomy, medical imaging, seismology and spectroscopy, and the fast Fourier transform (FFT) is a computationally efficient method of generating a Fourier transform. The emerging class of high performance computing architectures, such as GPU, seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software. However, the complexity of GPU programming poses a significant challenge to developers. In this paper, we propose an automatic performance tuning framework for FFT on various OpenCL GPUs, and implement a high performance library named MPFFT based on this framework. For power-of-two length FFTs, our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs. Furthermore, our library also supports non-power-of-two size. For 3D non-power-of-two FFTs, our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.

[1]  R. W. Johnson,et al.  A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures , 1990 .

[2]  Budirijanto Purnomo,et al.  ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs , 2010, SIGGRAPH '10.

[3]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[4]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[5]  Naga K. Govindaraju,et al.  Auto-tuning of fast fourier transform on graphics processors , 2011, PPoPP '11.

[6]  Dragan Mirkovic,et al.  Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms , 2007 .

[7]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[9]  Liang Gu,et al.  An empirically tuned 2D and 3D FFT library on CUDA GPU , 2010, ICS '10.

[10]  David Kaeli,et al.  Heterogeneous Computing with OpenCL , 2011 .

[11]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[12]  Paul N. Swarztrauber,et al.  Multiprocessor FFTs , 1987, Parallel Comput..

[13]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[14]  Wu-chun Feng,et al.  Power and Performance Characterization of Computational Kernels on the GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[15]  Markus Püschel,et al.  Offline library adaptation using automatically generated heuristics , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[17]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[18]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[19]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[20]  Dragan Mirkovic,et al.  Automatic Performance Tuning in the UHFFT Library , 2001, International Conference on Computational Science.

[21]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[22]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[23]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[24]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[25]  Feng Ji,et al.  Using Shared Memory to Accelerate MapReduce on Graphics Processing Units , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[26]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[27]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.