Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.

[1]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[2]  J. J. Lambiotte,et al.  Computing the Fast Fourier Transform on a vector computer , 1979 .

[3]  Paul N. Swarztrauber,et al.  FFT algorithms for vector computers , 1984, Parallel Comput..

[4]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[5]  S. Goedecker Rotating a three-dimensional array in an optimal position for vector processing: case study for a three-dimensional fast Fourier transform , 1993 .

[6]  Ramesh C. Agarwal,et al.  An efficient parallel algorithm for the 3-D FFT NAS parallel benchmark , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[7]  Markus Hegland Real and Complex Fast Fourier Transforms on the Fujitsu VPP 500 , 1996, Parallel Comput..

[8]  David K. McAllister,et al.  Fast Matrix Multiplies Using Graphics Hardware , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[9]  Mitsuo Yokokawa,et al.  16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[11]  Daisuke Takahashi Efficient implementation of parallel three-dimensional FFT on clusters of PCs , 2003 .

[12]  William R. Mark,et al.  Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[13]  Z. Weng,et al.  ZDOCK: An initial‐stage protein‐docking algorithm , 2003, Proteins.

[14]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[15]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[16]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[17]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[18]  Mark J. Stock,et al.  Toward efficient GPU-accelerated N-body simulations , 2008 .

[19]  Satoshi Matsuoka,et al.  An efficient, model-based CPU-GPU heterogeneous FFT library , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20]  Emilio L. Zapata,et al.  Memory Locality Exploitation Strategies for FFT on the CUDA Architecture , 2008, VECPAR.