AutoFFT: a template-based FFT codes auto-generation framework for ARM and X86 CPUs

The discrete Fourier transform (DFT) is widely used in scientific and engineering computation. This paper proposes a template-based code generation framework named AutoFFT that can automatically generate high-performance fast Fourier transform (FFT) codes. AutoFFT employs the Cooley-Tukey FFT algorithm, which exploits the symmetric and periodic properties of the DFT matrix as the outer parallelization framework. To further reduce the number of floating-point operations of butterflies, we explore more symmetric and periodic properties of the DFT matrix and formulate two optimized calculation templates for prime and power-of-two radices. To fully exploit hardware resources, we encapsulate a series of optimizations in an assembly template optimizer. Given any DFT problem, AutoFFT automatically generates C FFT kernels using these two templates and transfers them to efficient assembly codes using the template optimizer. Experiments show that AutoFFT outperforms FFTW, ARMPL, and Intel MKL on average across all FFT types on ARMv8 and Intel x86-64 processors.

[1]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[2]  Chunye Gong,et al.  An efficient parallel solution for Caputo fractional reaction–diffusion equation , 2014, The Journal of Supercomputing.

[3]  L. Johnsson,et al.  UHFFT : A High Performance DFT Framework , 2007 .

[4]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[5]  Dhairya Malhotra,et al.  AccFFT: A library for distributed-memory FFT on CPU and GPU architectures , 2015, ArXiv.

[6]  Anthony Blake,et al.  Dynamically Generating FFT Code , 2014, Journal of Signal Processing Systems.

[7]  Xiao Wang,et al.  Efficient parallel optimizations of a high-performance SIFT on GPUs , 2019, J. Parallel Distributed Comput..

[8]  S. Lennart Johnsson,et al.  Adaptive Computation of Self Sorting In-Place FFTs on Hierarchical Memory Architectures , 2007, HPCC.

[9]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[10]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[11]  Cris Cecka,et al.  Low Communication FMM-Accelerated FFT on GPUs , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13]  Peter D. Welch,et al.  The Fast Fourier Transform and Its Applications , 1969 .

[14]  Franz Franchetti,et al.  Formal loop merging for signal transforms , 2005, PLDI '05.

[15]  Zhibin Chen,et al.  Accurate simulation of turbulent phase screen using optimization method , 2019, Optik (Stuttgart).

[16]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[17]  G. Bruun z-transform DFT filters and FFT's , 1978 .

[18]  Satoshi Matsuoka,et al.  High performance 3-D FFT using multiple CUDA GPUs , 2012, GPGPU-5.

[19]  David A. Padua,et al.  SPL: a language and compiler for DSP algorithms , 2001, PLDI '01.

[20]  T. Parks,et al.  A prime factor FFT algorithm using high-speed convolution , 1977 .

[21]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[22]  Naga K. Govindaraju,et al.  Auto-tuning of fast fourier transform on graphics processors , 2011, PPoPP '11.

[23]  Pedro Costa,et al.  A FFT-based finite-difference solver for massively-parallel direct numerical simulations of turbulent flows , 2018, Comput. Math. Appl..

[24]  Paul N. Swarztrauber,et al.  Vectorizing the FFTs , 1982 .

[25]  L. Bluestein A linear filtering approach to the computation of discrete Fourier transform , 1970 .

[26]  C. Rader,et al.  A new principle for fast Fourier transformation , 1976 .

[27]  C. Rader Discrete Fourier transforms when the number of data samples is prime , 1968 .

[28]  Dan Petre,et al.  OpenCL™ FFT Optimizations for Intel® Processor Graphics , 2016, IWOCL.

[29]  Yiqun Liu,et al.  MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs , 2013, Journal of Computer Science and Technology.

[30]  Doru-Thom Popovici,et al.  Large Bandwidth-Efficient FFTs on Multicore and Multi-socket Systems , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31]  Chunye Gong,et al.  A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method , 2013 .

[32]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[33]  P. Duhamel,et al.  `Split radix' FFT algorithm , 1984 .

[34]  Thomas G. Stockham,et al.  High-speed convolution and correlation , 1966, AFIPS '66 (Spring).

[35]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[36]  Tze Meng Low,et al.  SPIRAL: Extreme Performance Portability , 2018, Proceedings of the IEEE.