Implementation of Parallel 1-D FFT on GPU Clusters

In this paper, we propose an implementation of a parallel one-dimensional fast Fourier transform (FFT) on GPU clusters. This implementation is based on the six-step FFT algorithm. Because the parallel one-dimensional FFT requires three all-to-all communications, one goal for parallel FFTs on GPU clusters is to minimize the PCI Express transfer time and the MPI communication time. We demonstrate that the advanced features of MVAPICH2-GPU make it easy to overlap PCI Express transfers and MPI communication. Performance results of one-dimensional FFTs on a GPU cluster are reported. We successfully achieved a performance of over 763 GFlops on 128 nodes of the HA-PACS (268 nodes, 2.99 TFlops/node, 802 TFlops peak performance) for 234-point FFT.

[1]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[2]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[3]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[4]  M. Hegland A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing , 1994 .

[5]  Ramesh C. Agarwal,et al.  A high performance parallel algorithm for 1-D FFT , 1994, Proceedings of Supercomputing '94.

[6]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[7]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[8]  Franz Franchetti,et al.  Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers , 2006, ISPA.

[9]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  Yasushi Negishi,et al.  Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.

[12]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[13]  Ping Tak Peter Tang,et al.  A framework for low-communication 1-D FFT , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Daisuke Takahashi,et al.  An Implementation of Parallel 1-D FFT on the K Computer , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[15]  Satoshi Matsuoka,et al.  Scalable multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.