Large-scale FFT on GPU clusters

A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (e. g. matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (e. g. finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to accelerate, as the bottleneck often lies with the PCI between main memory and GPU device memory or the communication network between workstation nodes. That means optimizing the performance of FFT for a single GPU device will not improve the overall performance. This paper uses large-scale FFT as an example to show how to achieve substantial speedups for these more challenging tasks on a GPU cluster. Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. This technique of manipulating array dimensions during data transfer is the main technical contribution of this paper. These factors (as well as the improved communication library in our implementation) attribute to 24.3x speedup with respect to FFTW and 7x speedup with respect to Intel MKL for 4096 3D single-precision FFT on a 16-node cluster with 32 GPUs. Around 5x speedup with respect to both standard libraries are achieved for double precision.

[1]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Heike Jagode Fourier Transforms for the BlueGene / L Communication Network , 2006 .

[3]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  V. Volkov,et al.  Fitting FFT onto the G 80 Architecture , 2008 .

[6]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[8]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[9]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[10]  Ramesh C. Agarwal,et al.  A high performance parallel algorithm for 1-D FFT , 1994, Proceedings of Supercomputing '94.

[11]  Massimiliano Fatica Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.

[12]  Yifeng Chen,et al.  Logic of global synchrony , 2001, TOPL.

[13]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, HiPC 2008.

[14]  Yifeng Chen,et al.  Improving Performance of Matrix Multiplication and FFT on GPU , 2009, 2009 15th International Conference on Parallel and Distributed Systems.