AccFFT: A library for distributed-memory FFT on CPU and GPU architectures

We present a new library for parallel distributed Fast Fourier Transforms (FFT). The importance of FFT in science and engineering and the advances in high performance computing necessitate further improvements. AccFFT extends existing FFT libraries for CUDA-enabled Graphics Processing Units (GPUs) to distributed memory clusters. We use overlapping communication method to reduce the overhead of PCIe transfers from/to GPU. We present numerical results on the Maverick platform at the Texas Advanced Computing Center (TACC) and on the Titan system at the Oak Ridge National Laboratory (ORNL). We present the scaling of the library up to 4,096 K20 GPUs of Titan.

[1]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[2]  Salvatore Filippone The IBM Parallel Engineering and Scientific Subroutine Library , 1995, PARA.

[3]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[4]  Jacob K. White,et al.  A precorrected-FFT method for electrostatic analysis of complicated 3-D structures , 1997, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[5]  J. Michel,et al.  Effective properties of composite materials with periodic microstructure : a computational approach , 1999 .

[6]  O. Bruno,et al.  A fast, high-order algorithm for the solution of surface scattering problems: basic implementation, tests, and applications , 2001 .

[7]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[8]  Charles S. Peskin,et al.  Shared-Memory Parallel Vector Implementation of the Immersed Boundary Method for the Computation of Blood Flow in the Beating Mammalian Heart , 2004, The Journal of Supercomputing.

[9]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[10]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[11]  Víctor M. Pérez-García,et al.  Spectral Methods for Partial Differential Equations in Irregular Domains: The Spectral Smoothed Boundary Method , 2006, SIAM J. Sci. Comput..

[12]  Y. Mukaigawa,et al.  Large Deviations Estimates for Some Non-local Equations I. Fast Decaying Kernels and Explicit Bounds , 2022 .

[13]  Stephen R. Comeau,et al.  PIPER: An FFT‐based protein docking program with pairwise potentials , 2006, Proteins.

[14]  P. Hut,et al.  Gravitational N-body Simulations , 2008, 0806.3950.

[15]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[16]  Daisuke Takahashi An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-core Processors , 2009, PPAM.

[17]  William Gropp,et al.  An introductory exascale feasibility study for FFTs and multigrid , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[18]  Ning Li,et al.  2DECOMP&FFT - A Highly Scalable 2D Decomposition Library and FFT Interface , 2010 .

[19]  Edmond Chow,et al.  Exploiting 162-Nanosecond End-to-End Communication Latency on Anton , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Liang Gu,et al.  Using GPUs to compute large out-of-card FFTs , 2011, ICS '11.

[21]  Ping Tak Peter Tang,et al.  A framework for low-communication 1-D FFT , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Franz Franchetti,et al.  Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P , 2012, VECPAR.

[23]  Jing Wu,et al.  Optimized strategies for mapping three-dimensional FFTs onto CUDA GPUs , 2012, 2012 Innovative Parallel Computing (InPar).

[24]  Richard W. Vuduc,et al.  On the communication complexity of 3D FFTs and its implications for Exascale , 2012, ICS '12.

[25]  Satoshi Matsuoka,et al.  Scalable multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[27]  Alistair P. Rendell,et al.  Implementation of 3D FFTs Across Multiple GPUs in Shared Memory Environments , 2012, 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[28]  Guang R. Gao,et al.  Demystifying Performance Predictions of Distributed FFT3D Implementations , 2012, NPC.

[29]  Lian-Ping Wang,et al.  Parallel implementation and scalability analysis of 3D Fast Fourier Transform using 2D domain decomposition , 2013, Parallel Comput..

[30]  D. Takahashi Implementation of Parallel 1-D FFT on GPU Clusters , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[31]  Myoungkyu Lee,et al.  Petascale direct numerical simulation of turbulent channel flow on up to 786K cores , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[32]  Daniel Potts,et al.  Parallel Three-Dimensional Nonequispaced Fast Fourier Transforms and Their Application to Particle Simulation , 2013, SIAM J. Sci. Comput..

[33]  Pradeep Dubey,et al.  Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[34]  Michael Pippig PFFT: An Extension of FFTW to Massively Parallel Architectures , 2013, SIAM J. Sci. Comput..

[35]  Hari Sundar,et al.  HykSort: a new variant of hypercube quicksort on distributed memory architectures , 2013, ICS '13.

[36]  Jesper Larsson Träff,et al.  Implementing a classic: zero-copy all-to-all communication with mpi datatypes , 2014, ICS '14.

[37]  Jeffrey K. Hollingsworth,et al.  Scaling Parallel 3-D FFT with Non-Blocking MPI Collectives , 2014, 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.

[38]  Truong Vinh Truong Duy,et al.  A decomposition method with minimum communication amount for parallelization of multi-dimensional FFTs , 2014, Comput. Phys. Commun..

[39]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[40]  Martin D. Schatz,et al.  Parallel Matrix Multiplication: A Systematic Journey , 2016, SIAM J. Sci. Comput..