A framework for low-communication 1-D FFT

In high-performance computing on distributed-memory systems, communication often represents a significant part of the overall execution time. The relative cost of communication will certainly continue to rise as compute-density growth follows the current technology and industry trends. Design of lower-communication alternatives to fundamental computational algorithms has become an important field of research. For distributed 1-D FFT, communication cost has hitherto remained high as all industry-standard implementations perform three all-to-all internode data exchanges (also called global transposes). These communication steps indeed dominate execution time. In this paper, we present a mathematical framework from which many single-all-to-all and easy-to-implement 1-D FFT algorithms can be derived. For large-scale problems, our implementation can be twice as fast as leading FFT libraries on state-of-the-art computer clusters. Moreover, our framework allows tradeoff between accuracy and performance, further boosting performance if reduced accuracy is acceptable.

[1]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[2]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[3]  Jeffrey A. Fessler,et al.  Nonuniform fast Fourier transforms using min-max interpolation , 2003, IEEE Trans. Signal Process..

[4]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[5]  D. S. Scott,et al.  Efficient All-to-All Communication Patterns in Hypercube and Mesh Topologies , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[6]  E. Brigham,et al.  The fast Fourier transform and its applications , 1988 .

[7]  Piotr Indyk,et al.  Simple and practical algorithm for sparse Fourier transform , 2012, SODA.

[8]  Jack J. Dongarra,et al.  Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Izidor Gertner,et al.  A Parallel Algorithm for 2-D DFT Computation with No Interprocessor Communication , 1990, IEEE Trans. Parallel Distributed Syst..

[10]  A. Papoulis,et al.  The Fourier Integral and Its Applications , 1963 .

[11]  W. M. Gentleman,et al.  Fast Fourier Transforms: for fun and profit , 1966, AFIPS '66 (Fall).

[12]  J. Benedetto,et al.  Sampling multipliers and the Poisson Summation Formula , 1997 .

[13]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[14]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[15]  Gabriele Steidl,et al.  Fast Fourier Transforms for Nonequispaced Data: A Tutorial , 2001 .

[16]  Earl E. Swartzlander,et al.  Parallel Implementation of Multidimensional Transforms without Interprocessor Communication , 1999, IEEE Trans. Computers.

[17]  V. Rokhlin,et al.  Fast Fourier Transforms for Nonequispaced Data, II , 1995 .

[18]  Sivan Toledo,et al.  The Future Fast Fourier Transform? , 1997, PPSC.

[19]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[20]  P. Heywood Trigonometric Series , 1968, Nature.

[21]  Don H. Johnson,et al.  Gauss and the history of the fast Fourier transform , 1985 .

[22]  R. Al Na'mneh,et al.  Communication efficient adaptive matrix transpose algorithm for FFT on symmetric multiprocessors , 2005, Proceedings of the Thirty-Seventh Southeastern Symposium on System Theory, 2005. SSST '05..

[23]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[24]  Vladimir Rokhlin,et al.  Fast Fourier Transforms for Nonequispaced Data , 1993, SIAM J. Sci. Comput..

[25]  Philip Heidelberger,et al.  Optimization of All-to-All Communication on the Blue Gene/L Supercomputer , 2008, 2008 37th International Conference on Parallel Processing.

[26]  Piotr Indyk,et al.  Nearly optimal sparse fourier transform , 2012, STOC '12.

[27]  Richard Tolimieri,et al.  A hybrid parallel M-D FFT algorithm without interprocessor communication , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Daisuke Takahashi A parallel 1-D FFT algorithm for the Hitachi SR8000 , 2003, Parallel Comput..

[29]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[30]  Yasushi Negishi,et al.  Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  James Demmel,et al.  Graph expansion and communication costs of fast matrix multiplication: regular submission , 2011, SPAA '11.

[32]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[33]  Thomas Hérault,et al.  QR factorization of tall and skinny matrices in a grid computing environment , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[34]  Fernando Reitich,et al.  Prescribed error tolerances within fixed computational times for scattering problems of arbitrarily high frequency: the convex case , 2004, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[35]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[36]  Seong-Moo Yoo,et al.  Parallel Implementations of 1-D Fast Fourier Transform Without Interprocessor Communication , 2007 .