Four Easy Ways to a Faster FFT

The Fast Fourier Transform (FFT) was named one of the Top Ten algorithms of the 20th century , and continues to be a focus of current research. A problem with currently used FFT packages is that they require large, finely tuned, machine specific libraries, produced by highly skilled software developers. Therefore, these packages fail to perform well across a variety of architectures. Furthermore, many need to run repeated experiments in order to ‘re-program’ their code to its optimal performance based on a given machine's underlying hardware. Finally, it is difficult to know which radix to use given a particular vector size and machine configuration. We propose the use of monolithic array analysis as a way to remove the constraints imposed on performance by a machine's underlying hardware, by pre-optimizing array access patterns. In doing this we arrive at a single optimized program. We have achieved up to a 99.6% increase in performance, and the ability to run vectors up to 8 388 608 elements larger, on our experimental platforms. Preliminary experiments indicate different radices perform better relative to a machine's underlying architecture.

[1]  Steve Karmesin,et al.  Array Design and Expression Evaluation in POOMA II , 1998, ISCOPE.

[2]  Sven-Bodo Scholz,et al.  On Programming Scientific Applications in SAC - A Functional Language Extended by a Subsystem for High-Level Array Operations , 1996, Implementation of Functional Languages.

[3]  Bradford L. Chamberlain,et al.  The case for high-level parallel programming in ZPL , 1998 .

[4]  Harry B. Hunt,et al.  On Materializations of Array-Valued Temporaries , 2000, LCPC.

[5]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[6]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[7]  N. Ahmed,et al.  FAST TRANSFORMS, algorithms, analysis, applications , 1983, Proceedings of the IEEE.

[8]  Lawrence Snyder,et al.  ZPL: An Array Sublanguage , 1993, LCPC.

[9]  Todd L. Veldhuizen,et al.  Expression templates , 1996 .

[10]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[11]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[12]  Sadayappan,et al.  EXTENT : A Portable Programming and Implementing High-Performance , 1997 .

[13]  Chao Lu,et al.  Mathematics of Multidimensional Fourier Transform Algorithms , 1993 .

[14]  Ramesh C. Agarwal,et al.  A high performance parallel algorithm for 1-D FFT , 1994, Proceedings of Supercomputing '94.

[15]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16]  Sandeep K. S. Gupta,et al.  On the Synthesis of Parallel Programs from Tensor Product Formulas for Block Recursive Algorithms , 1992, LCPC.

[17]  Sandeep K. S. Gupta,et al.  Implementing Fast Fourier Transforms on Distributed-Memory Multiprocessors Using Data Redistributions , 1994, Parallel Process. Lett..

[18]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[19]  T. Forshaw Everything you always wanted to know , 1977 .

[20]  Anthony Skjellum,et al.  Driving Issues in Scalable Libraries: Poly-Algorithms, Data Distribution Independence, Redistribution, Local Storage Schemes , 1995, PPSC.

[21]  Vipin Kumar,et al.  The Scalability of FFT on Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[22]  Andrew Lumsdaine,et al.  Parallel Extensions to the Matrix Template Library , 1997, PPSC.

[23]  Bradford L. Chamberlain,et al.  Factor-Join: A Unique Approach to Compiling Array Languages for Parallel Machines , 1996, LCPC.

[24]  Lenore M. Restifo Mullin,et al.  Formal method for scheduling, routing and communication protocol , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[25]  D. Miles Compute intensity and the FFT , 1993, Supercomputing '93.

[26]  Sandeep K. S. Gupta,et al.  A Framework for Generating Distributed-Memory Parallel Programs for Block Recursive Algorithms , 1986, J. Parallel Distributed Comput..

[27]  Todd L. Veldhuizen,et al.  Using C++ template metaprograms , 1996 .

[28]  Michael Conner,et al.  Recursive fast algorithm and the role of the tensor product , 1992, IEEE Trans. Signal Process..

[29]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[30]  Anthony Skjellum,et al.  A poly‐algorithm for parallel dense matrix multiplication on two‐dimensional process grid topologies , 1997 .

[31]  Steve Karmesin,et al.  Optimization of Data-Parallel Field Expressions in the POOMA Framework , 1997, ISCOPE.

[32]  Jeremy G. Siek,et al.  The Matrix Template Library: A Generic Programming Approach to High Performance Numerical Linear Algebra , 1998, ISCOPE.

[33]  Bradford L. Chamberlain,et al.  A Compiler Abstraction for Machine Independent Parallel Communication Generation , 1997, LCPC.

[34]  L. Mullin A mathematics of arrays , 1988 .