An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-core Processors

In this paper, we propose an implementation of a parallel three-dimensional fast Fourier transform (FFT) with two-dimensional decomposition on a massively parallel cluster of multi-core processors. The proposed parallel three-dimensional FFT algorithm is based on the multicolumn FFT algorithm. We show that a two-dimensional decomposition effectively improves performance by reducing the communication time for larger numbers of MPI processes. We successfully achieved a performance of over 401 GFlops on 256 nodes of Appro Xtreme-X3 (648 nodes, 147.2 GFlops/node, 95.4 TFlops peak performance) for 2563-point FFT.