PFFT: An Extension of FFTW to Massively Parallel Architectures

We present an MPI based software library for computing fast Fourier transforms (FFTs) on massively parallel, distributed memory architectures based on the Message Passing Interface standard (MPI). Similar to established transpose FFT algorithms, we propose a parallel FFT framework that is based on a combination of local FFTs, local data permutations, and global data transpositions. This framework can be generalized to arbitrary multidimensional data and process meshes. All performance-relevant building blocks can be implemented with the help of the FFTW software library. Therefore, our library offers great flexibility and portable performance. Similarly to FFTW, we are able to compute FFTs of complex data, real data, and even- or odd-symmetric real data. All the transforms can be performed completely in place. Furthermore, we propose an algorithm to calculate pruned FFTs more efficiently on distributed memory architectures. For example, we provide performance measurements of FFTs of sizes between $512^3$ ...

[1]  José E. Moreira,et al.  A Volumetric FFT for BlueGene/L , 2003, HiPC.

[2]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[3]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[4]  Ning Li,et al.  2DECOMP&FFT - A Highly Scalable 2D Decomposition Library and FFT Interface , 2010 .

[5]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[6]  Michael Pippig An Efficient and Flexible Parallel FFT Implementation Based on FFTW , 2010, CHPC.

[7]  Steven J. Plimpton,et al.  Particle{Mesh Ewald and rRESPA for Parallel Molecular Dynamics Simulations , 1997 .

[8]  Bin Fang,et al.  Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer , 2007, Comput. Phys. Commun..

[9]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[10]  Daisuke Takahashi An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-core Processors , 2009, PPAM.

[11]  Chris H. Q. Ding,et al.  A Portable 3D FFT Package for Distributed-Memory Parallel Architectures , 1995, PPSC.

[12]  Salvatore Filippone The IBM Parallel Engineering and Scientific Subroutine Library , 1995, PARA.

[13]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[14]  Vipin Kumar,et al.  The Scalability of FFT on Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..