论文信息 - Four Easy Ways to a Faster FFT

Four Easy Ways to a Faster FFT

The Fast Fourier Transform (FFT) was named one of the Top Ten algorithms of the 20th century , and continues to be a focus of current research. A problem with currently used FFT packages is that they require large, finely tuned, machine specific libraries, produced by highly skilled software developers. Therefore, these packages fail to perform well across a variety of architectures. Furthermore, many need to run repeated experiments in order to ‘re-program’ their code to its optimal performance based on a given machine's underlying hardware. Finally, it is difficult to know which radix to use given a particular vector size and machine configuration. We propose the use of monolithic array analysis as a way to remove the constraints imposed on performance by a machine's underlying hardware, by pre-optimizing array access patterns. In doing this we arrive at a single optimized program. We have achieved up to a 99.6% increase in performance, and the ability to run vectors up to 8 388 608 elements larger, on our experimental platforms. Preliminary experiments indicate different radices perform better relative to a machine's underlying architecture.

Sharon G. Small | Lenore M. Restifo Mullin | L. Mullin | S. Small

[1] Steve Karmesin,et al. Array Design and Expression Evaluation in POOMA II , 1998, ISCOPE.

[2] Sven-Bodo Scholz,et al. On Programming Scientific Applications in SAC - A Functional Language Extended by a Subsystem for High-Level Array Operations , 1996, Implementation of Functional Languages.

[3] Bradford L. Chamberlain,et al. The case for high-level parallel programming in ZPL , 1998 .

[4] Harry B. Hunt,et al. On Materializations of Array-Valued Temporaries , 2000, LCPC.

[5] C. Loan. Computational Frameworks for the Fast Fourier Transform , 1992 .

[6] Todd L. Veldhuizen,et al. Arrays in Blitz++ , 1998, ISCOPE.

[7] N. Ahmed,et al. FAST TRANSFORMS, algorithms, analysis, applications , 1983, Proceedings of the IEEE.

[8] Lawrence Snyder,et al. ZPL: An Array Sublanguage , 1993, LCPC.

[9] Todd L. Veldhuizen,et al. Expression templates , 1996 .

[10] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[11] R. Tolimieri,et al. Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[12] Sadayappan,et al. EXTENT : A Portable Programming and Implementing High-Performance , 1997 .

[13] Chao Lu,et al. Mathematics of Multidimensional Fourier Transform Algorithms , 1993 .

[14] Ramesh C. Agarwal,et al. A high performance parallel algorithm for 1-D FFT , 1994, Proceedings of Supercomputing '94.

[15] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16] Sandeep K. S. Gupta,et al. On the Synthesis of Parallel Programs from Tensor Product Formulas for Block Recursive Algorithms , 1992, LCPC.

[17] Sandeep K. S. Gupta,et al. Implementing Fast Fourier Transforms on Distributed-Memory Multiprocessors Using Data Redistributions , 1994, Parallel Process. Lett..

[18] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[19] T. Forshaw. Everything you always wanted to know , 1977 .

[20] Anthony Skjellum,et al. Driving Issues in Scalable Libraries: Poly-Algorithms, Data Distribution Independence, Redistribution, Local Storage Schemes , 1995, PPSC.

[21] Vipin Kumar,et al. The Scalability of FFT on Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[22] Andrew Lumsdaine,et al. Parallel Extensions to the Matrix Template Library , 1997, PPSC.

[23] Bradford L. Chamberlain,et al. Factor-Join: A Unique Approach to Compiling Array Languages for Parallel Machines , 1996, LCPC.

[24] Lenore M. Restifo Mullin,et al. Formal method for scheduling, routing and communication protocol , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[25] D. Miles. Compute intensity and the FFT , 1993, Supercomputing '93.

[26] Sandeep K. S. Gupta,et al. A Framework for Generating Distributed-Memory Parallel Programs for Block Recursive Algorithms , 1986, J. Parallel Distributed Comput..

[27] Todd L. Veldhuizen,et al. Using C++ template metaprograms , 1996 .

[28] Michael Conner,et al. Recursive fast algorithm and the role of the tensor product , 1992, IEEE Trans. Signal Process..

[29] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[30] Anthony Skjellum,et al. A poly‐algorithm for parallel dense matrix multiplication on two‐dimensional process grid topologies , 1997 .

[31] Steve Karmesin,et al. Optimization of Data-Parallel Field Expressions in the POOMA Framework , 1997, ISCOPE.

[32] Jeremy G. Siek,et al. The Matrix Template Library: A Generic Programming Approach to High Performance Numerical Linear Algebra , 1998, ISCOPE.

[33] Bradford L. Chamberlain,et al. A Compiler Abstraction for Machine Independent Parallel Communication Generation , 1997, LCPC.

[34] L. Mullin. A mathematics of arrays , 1988 .