Scalable multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer

For scalable 3-D FFT computation using multiple GPUs, efficient all-to-all communication between GPUs is the most important factor in good performance. Implementations with point-to-point MPI library functions and CUDA memory copy APIs typically exhibit very large overheads especially for small message sizes in all-to-all communications between many nodes. We propose several schemes to minimize the overheads, including employment of lower-level API of InfiniBand to effectively overlap intra- and inter-node communication, as well as auto-tuning strategies to control scheduling and determine rail assignments. As a result we achieve very good strong scalability as well as good performance, up to 4.8TFLOPS using 256 nodes of TSUBAME 2.0 Supercomputer (768 GPUs) in double precision.

[1]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.

[2]  平田 文男 Molecular theory of solvation , 2003 .

[3]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[4]  Dawid Pajak General-Purpose Computation Using Graphics Hardware for Fast HDR Image Processing , 2007 .

[5]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Yasuomi Kiyota,et al.  A New Approach for Investigating the Molecular Recognition of Protein: Toward Structure-Based Drug Design Based on the 3D-RISM Theory. , 2011, Journal of chemical theory and computation.

[7]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[8]  Stephen R. Comeau,et al.  PIPER: An FFT‐based protein docking program with pairwise potentials , 2006, Proteins.

[9]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Christopher E. Cramer,et al.  The Development and Integration of a Distributed 3D FFT for a Cluster of Workstations , 2000, Annual Linux Showcase & Conference.

[11]  Naga K. Govindaraju,et al.  Auto-tuning of fast fourier transform on graphics processors , 2011, PPoPP '11.

[12]  Fumio Hirata,et al.  Ligand mapping on protein surfaces by the 3D-RISM theory: toward computational fragment-based drug design. , 2009, Journal of the American Chemical Society.

[13]  V. Volkov,et al.  Fitting FFT onto the G 80 Architecture , 2008 .

[14]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[15]  Z. Weng,et al.  ZDOCK: An initial‐stage protein‐docking algorithm , 2003, Proteins.

[16]  Robert S. Germain,et al.  Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements , 2005, IBM J. Res. Dev..

[17]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[18]  Daisuke Takahashi An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-core Processors , 2009, PPAM.

[19]  Christophe Calvin,et al.  Implementation of Parallel FFT Algorithms on Distributed Memory Machines with a Minimum Overhed of Communication , 1996, Parallel Comput..

[20]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[21]  Satoshi Matsuoka,et al.  High performance 3-D FFT using multiple CUDA GPUs , 2012, GPGPU-5.

[22]  Mitsuo Yokokawa,et al.  16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[23]  Richard Vuduc,et al.  Prospects for scalable 3D FFTs on heterogeneous exascale systems , 2011 .

[24]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  Yasushi Negishi,et al.  Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.