Auto-tuning 3-D FFT library for CUDA GPUs

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.

[1]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[2]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[3]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[4]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[5]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[7]  Mark J. Stock,et al.  Toward efficient GPU-accelerated N-body simulations , 2008 .

[8]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[9]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[11]  Dawid Pajak General-Purpose Computation Using Graphics Hardware for Fast HDR Image Processing , 2007 .

[12]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  V. Volkov,et al.  Fitting FFT onto the G 80 Architecture , 2008 .

[15]  David K. McAllister,et al.  Fast Matrix Multiplies Using Graphics Hardware , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[16]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[17]  Satoshi Matsuoka The Rise of the Commodity Vectors , 2008, VECPAR.

[18]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .