Large-scale FFTs and convolutions on Apple hardware

Impressive FFT performance for large signal lengths can be achieved via a matrix paradigm that exploits the modern concepts of cache, memory, and multicore/multithreading. Each of the large-scale FFT implementations we report herein is built hierarchically on very fast FFTs from the standard OS X Accelerate library. (The hierarchical ideas should apply equally well for low-level FFTs of, say, the OpenCL/GPU variety.) By building on such established, packaged, small-length FFTs, one can achieve on a single Apple machine—and even for signal lengths into the billions—sustained processing rates in the multi-gigaflop/s region.

[1]  Jason Klivington,et al.  Supercomputer-style FFT library for Apple G 4 , 2000 .

[2]  C. Pomerance,et al.  Prime Numbers: A Computational Perspective , 2002 .

[3]  Steven G. Johnson,et al.  A Modified Split-Radix FFT With Fewer Arithmetic Operations , 2007, IEEE Transactions on Signal Processing.

[4]  T. Lundy,et al.  A new matrix approach to real FFTs and convolutions of length 2k , 2007, Computing.

[5]  Paul N. Swarztrauber,et al.  FFT algorithms for vector computers , 1984, Parallel Comput..

[6]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[7]  W. M. Gentleman,et al.  Fast Fourier Transforms: for fun and profit , 1966, AFIPS '66 (Fall).

[8]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).